检索结果-内蒙古大学图书馆

Integral reinforcement learning for Linear Continuous-Time Zero-Sum Games With Completely Unknown dynamics

ieee TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING 2014年第3期11卷 706-714页

作者： Li, Hongliang Liu, Derong Wang, Ding Chinese Acad Sci Inst Automat State Key Lab Management & Control Complex Syst Beijing 100190 Peoples R China

In this paper, we develop an integral reinforcement learning algorithm based on policy iteration to learn online the Nash equilibrium solution for a two-player zero-sum differential game with completely unknown linear continuous-time dynamics. This algorithm is a fully model-free method solving the game algebraic Riccati equation forward in time. The developed algorithm updates value function, control and disturbance policies simultaneously. The convergence of the algorithm is demonstrated to be equivalent to Newton's method. To implement this algorithm, one critic network and two action networks are used to approximate the game value function, control and disturbance policies, respectively, and the least squares method is used to estimate the unknown parameters. The effectiveness of the developed scheme is demonstrated in the simulation by designing an H-infinity state feedback controller for a power system. Note to Practitioners-Noncooperative zero-sum differential game provides an ideal tool to study multiplayer optimal decision and control problems. Existing approaches usually solve the Nash equilibrium solution by means of offline iterative computation, and require the exact knowledge of the system dynamics. However, it is difficult to obtain the exact knowledge of the system dynamics for many real-world industrial systems. The algorithm developed in this paper is a fully model-free method which solves the zero-sum differential game problem forward in time by making use of online measured data. This method is not affected by errors between an identification model and a real system, and responds fast to changes of the system dynamics. Exploration signals are required to satisfy the persistence of excitation condition to update the value function and the policies, and these signals do not affect the convergence of the learning process. The least squares method is used to obtain the approximate solution for the zero-sum games with unknown dynamics. The developed a

关键词： adaptive critic designs adaptive dynamic programming approximate dynamic programming reinforcement learning policy iteration zero-sum games

来源：评论

学校读者我要写书评

暂无评论

learning-Based adaptive Optimal Control for Connected Vehicles in Mixed Traffic: Robustness to Driver Reaction Time

引用

ieee TRANSACTIONS ON CYBERNETICS 2022年第6期52卷 5267-5277页

作者： Huang, Mengzhe Jiang, Zhong-Ping Ozbay, Kaan NYU Tandon Sch Engn Dept Elect & Comp Engn Control & Networks Lab Brooklyn NY 11201 USA NYU Tandon Sch Engn C2SMART Ctr Brooklyn NY 11201 USA

Through vehicle-to-vehicle (V2V) communication, both human-driven and autonomous vehicles can actively exchange data, such as velocities and bumper-to-bumper distances. Employing the shared data, control laws with improved performance can be designed for connected and autonomous vehicles (CAVs). In this article, taking into account human-vehicle interaction and heterogeneous driver behavior, an adaptive optimal control design method is proposed for a platoon mixed with multiple preceding human-driven vehicles and one CAV at the tail. It is shown that by using reinforcement learning and adaptive dynamic programming techniques, a near-optimal controller can be learned from real-time data for the CAV with V2V communications, but without the precise knowledge of the accurate car-following parameters of any driver in the platoon. The proposed method allows the CAV controller to adapt to different platoon dynamics caused by the unknown and heterogeneous driver-dependent parameters. To improve the safety performance during the learning process, our off-policy learning algorithm can leverage both the historical data and the data collected in real time, which leads to considerably reduced learning time duration. The effectiveness and efficiency of our proposed method is demonstrated by rigorous proofs and microscopic traffic simulations.

关键词： Vehicles Mathematical model Delays Vehicle dynamics Autonomous vehicles Real-time systems Safety adaptive dynamic programming (ADP) autonomous vehicles connected vehicles time-delay system

来源：评论

学校读者我要写书评

暂无评论

An Effective PQ-Decoupling Control Scheme Using adaptive dynamic programming Approach to Reducing Oscillations of Virtual Synchronous Generators for Grid Connection With Different Impedance Types

引用

ieee TRANSACTIONS ON INDUSTRIAL ELECTRONICS 2024年第4期71卷 3763-3775页

作者： Wang, Zhongyang Wang, Youqing Davari, Masoud Blaabjerg, Frede Beijing Univ Chem Technol Coll Informat Sci & Technol Beijing 100029 Peoples R China Georgia Southern Univ Statesboro Dept Elect & Comp Engn Statesboro GA 30460 USA Aalborg Univ AAU Energy Dept DK-9220 Aalborg Denmark

The power coupling of the virtual synchronous generator (VSG) in the grid-connected mode may aggravate power oscillation because of a resistance-inductive line. In order to deal with this issue, this research study proposes an adaptive and optimal approach to controlling VSG via reinforcement learning and adaptive dynamic programming (ADP). It derives the linear and nonlinear hybrid equations of the VSG power considering the case where the line impedance is uncertain. The nonlinear part is a disturbance, and the linear ADP solves the optimal feedback control and compensation controller, eliminating the interaction between the active power and reactive power. Also, the proposed method utilizes value iteration and is data-driven. Thus, it does not rely on an initial stability control gain and an accurate dynamic model during the learning process. Comparative experiments reveal the effectiveness of the proposed method and validate the practicability of the methodology introduced;in addition, comparative simulations present the superiority of the proposed method in power systems based on synchronous generators.

关键词： adaptive dynamic programming (ADP) coupling between active power and reactive power linear-quadratic regulator (LQR) optimal feedback controller value iteration (VI) virtual synchronous generator (VSG)

来源：评论

学校读者我要写书评

暂无评论

reinforcement learning in multidimensional continuous action spaces

Reinforcement learning in multidimensional continuous action...

引用

ieee symposium on adaptive dynamic programming and reinforcement learning

作者： Pazis, Jason Lagoudakis, Michail G. Department of Computer Science Duke University Durham NC 27708-0129 United States Department of Electronic and Computer Engineering Technical University of Crete Chania Crete 73100 Greece

ISBN: (纸本)9781424498888

The majority of learning algorithms available today focus on approximating the state (V ) or state-action (Q) value function and efficient action selection comes as an afterthought. On the other hand, real-world problems tend to have large action spaces, where evaluating every possible action becomes impractical. This mismatch presents a major obstacle in successfully applying reinforcement learning to real-world problems. In this paper we present an effective approach to learning and acting in domains with multidimensional and/or continuous control variables where efficient action selection is embedded in the learning process. Instead of learning and representing the state or state-action value function of the MDP, we learn a value function over an implied augmented MDP, where states represent collections of actions in the original MDP and transitions represent choices eliminating parts of the action space at each step. Action selection in the original MDP is reduced to a binary search by the agent in the transformed MDP, with computational complexity logarithmic in the number of actions, or equivalently linear in the number of action dimensions. Our method can be combined with any discrete-action reinforcement learning algorithm for learning multidimensional continuous-action policies using a state value approximator in the transformed MDP. Our preliminary results with two well-known reinforcement learning algorithms (Least-Squares Policy Iteration and Fitted Q-Iteration) on two continuous action domains (1-dimensional inverted pendulum regulator, 2-dimensional bicycle balancing) demonstrate the viability and the potential of the proposed approach. © 2011 ieee.

关键词： reinforcement learning

来源：评论

学校读者我要写书评

暂无评论

The knowledge gradient policy for offline learning with independent normal rewards

The knowledge gradient policy for offline learning with inde...

引用

ieee International symposium on Approximate dynamic programming and reinforcement learning

作者： Frazier, Peter Powell, Warren Princeton Univ Dept Operat Res & Financial Engn Princeton NJ 08544 USA

We define a new type of policy, the knowledge gradient policy, in the context of an offline learning problem. We show how to compute the knowledge gradient policy efficiently and demonstrate through Monte Carlo simula... 详细信息

ISBN: (纸本)9781424407064

关键词： learning systems

来源：评论

学校读者我要写书评

暂无评论

GrDHP: A General Utility Function Representation for Dual Heuristic dynamic programming

引用

ieee TRANSACTIONS ON NEURAL NETWORKS AND learning SYSTEMS 2015年第3期26卷 614-627页

作者： Ni, Zhen He, Haibo Zhao, Dongbin Xu, Xin Prokhorov, Danil V. Univ Rhode Isl Dept Elect Comp & Biomed Engn Kingston RI 02881 USA Chinese Acad Sci Inst Automat State Key Lab Management & Control Complex Syst Beijing 100190 Peoples R China Natl Univ Def Technol Coll Mechatron & Automat Changsha 410073 Hunan Peoples R China Toyota Res Inst NA Toyota Tech Ctr Ann Arbor MI 48105 USA

A general utility function representation is proposed to provide the required derivable and adjustable utility function for the dual heuristic dynamic programming (DHP) design. Goal representation DHP (GrDHP) is presented with a goal network being on top of the traditional DHP design. This goal network provides a general mapping between the system states and the derivatives of the utility function. With this proposed architecture, we can obtain the required derivatives of the utility function directly from the goal network. In addition, instead of a fixed predefined utility function in literature, we conduct an online learning process for the goal network so that the derivatives of the utility function can be adaptively tuned over time. We provide the control performance of both the proposed GrDHP and the traditional DHP approaches under the same environment and parameter settings. The statistical simulation results and the snapshot of the system variables are presented to demonstrate the improved learning and controlling performance. We also apply both approaches to a power system example to further demonstrate the control capabilities of the GrDHP approach.

关键词： adaptive control adaptive dynamic programming (ADP) dual heuristic dynamic programming (DHP) general utility function goal representation reinforcement learning (RL)

来源：评论

学校读者我要写书评

暂无评论

Multi-Objective reinforcement learning for AUV Thruster Failure Recovery

Multi-Objective Reinforcement Learning for AUV Thruster Fail...

引用

ieee symposium on adaptive dynamic programming and reinforcement learning (ADPRL)

作者： Ahmadzadeh, Seyed Reza Kormushev, Petar Caldwell, Darwin G. Ist Italiano Tecnol Dept Adv Robot Via Morego 30 I-16163 Genoa Italy

ISBN: (纸本)9781479945528

This paper investigates learning approaches for discovering fault-tolerant control policies to overcome thruster failures in Autonomous Underwater Vehicles (AUV). The proposed approach is a model-based direct policy search that learns on an on-board simulated model of the vehicle. When a fault is detected and isolated the model of the AUV is reconfigured according to the new condition. To discover a set of optimal solutions a multi-objective reinforcement learning approach is employed which can deal with multiple conflicting objectives. Each optimal solution can be used to generate a trajectory that is able to navigate the AUV towards a specified target while satisfying multiple objectives. The discovered policies are executed on the robot in a closed-loop using AUV's state feedback. Unlike most existing methods which disregard the faulty thruster, our approach can also deal with partially broken thrusters to increase the persistent autonomy of the AUV. In addition, the proposed approach is applicable when the AUV either becomes under-actuated or remains redundant in the presence of a fault. We validate the proposed approach on the model of the Girona500 AUV.

关键词： autonomous underwater vehicles closed loop systems control engineering computing fault diagnosis learning (artificial intelligence) mobile robots optimal control state feedback AUV state feedback AUV thruster failure recovery Girona500 AUV closed-loop conflicting objective fault detection fault-tolerant control policy faulty thruster model-based direct policy search multiobjective reinforcement learning approach on-board simulated model optimal solution Optimization Sociology Statistics Trajectory Vectors Vehicle dynamics Vehicles Autonomous underwater vehicles control engineering computing Closed loop systems State feedback optimal solution trajectory Sociology vehicle Vehicle dynamics Mobile robots Defect detection Fault diagnosis learning (artificial intelligence) Optimal control CLOSED LOOP

来源：评论

学校读者我要写书评

暂无评论

Active exploration for robot parameter selection in episodic reinforcement learning

Active exploration for robot parameter selection in episodic...

引用

ieee symposium on adaptive dynamic programming and reinforcement learning

作者： Kroemer, Oliver Peters, Jan Max Planck Institute 38 Spemannstr. Tuebingen 72012 Germany

ISBN: (纸本)9781424498888

As the complexity of robots and other autonomous systems increases, it becomes more important that these systems can adapt and optimize their settings actively. However, such optimization is rarely trivial. Sampling from the system is often expensive in terms of time and other costs, and excessive sampling should therefore be avoided. The parameter space is also usually continuous and multi-dimensional. Given the inherent exploration-exploitation dilemma of the problem, we propose treating it as an episodic reinforcement learning problem. In this reinforcement learning framework, the policy is defined by the system's parameters and the rewards are given by the system's performance. The rewards accumulate during each episode of a task. In this paper, we present a method for efficiently sampling and optimizing in continuous multidimensional spaces. The approach is based on Gaussian process regression, which can represent continuous non-linear mappings from parameters to system performance. We employ an upper confidence bound policy, which explicitly manages the trade-off between exploration and exploitation. Unlike many other policies for this kind of problem, we do not rely on a discretization of the action space. The presented method was evaluated on a real robot. The robot had to learn grasping parameters in order to adapt its grasping execution to different objects. The proposed method was also tested on a more general gain tuning problem. The results of the experiments show that the presented method can quickly determine suitable parameters and is applicable to real online learning applications. © 2011 ieee.

关键词： reinforcement learning

来源：评论

学校读者我要写书评

暂无评论

Asymptotically Stable adaptive-Optimal Control Algorithm With Saturating Actuators and Relaxed Persistence of Excitation

引用

ieee TRANSACTIONS ON NEURAL NETWORKS AND learning SYSTEMS 2016年第11期27卷 2386-2398页

作者： Vamvoudakis, Kyriakos G. Miranda, Marcio Fantini Hespanha, Joao P. Univ Calif Santa Barbara Ctr Control Dynam Syst & Computat Santa Barbara CA 93106 USA Univ Fed Minas Gerais Colegio Tecn BR-31270901 Belo Horizonte MG Brazil

This paper proposes a control algorithm based on adaptive dynamic programming to solve the infinite-horizon optimal control problem for known deterministic nonlinear systems with saturating actuators and nonquadratic cost functionals. The algorithm is based on an actor/critic framework, where a critic neural network (NN) is used to learn the optimal cost, and an actor NN is used to learn the optimal control policy. The adaptive control nature of the algorithm requires a persistence of excitation condition to be a priori validated, but this can be relaxed using previously stored data concurrently with current data in the update of the critic NN. A robustifying control term is added to the controller to eliminate the effect of residual errors, leading to the asymptotically stability of the closed-loop system. Simulation results show the effectiveness of the proposed approach for a controlled Van der Pol oscillator and also for a power system plant.

关键词： Approximate dynamic programming (ADP) optimal control reinforcement learning saturating actuators

来源：评论

学校读者我要写书评

暂无评论

Closed-Loop Control of Anesthesia and Mean Arterial Pressure Using reinforcement learning

Closed-Loop Control of Anesthesia and Mean Arterial Pressure...

引用

ieee symposium on adaptive dynamic programming and reinforcement learning (ADPRL)

作者： Padmanabhan, Regina Meskin, Nader Haddad, Wassim M. Qatar Univ Dept Elect Engn Doha Qatar Georgia Inst Technol Sch Aerosp Engn Atlanta GA 30332 USA

ISBN: (纸本)9781479945528

General anesthesia is required for patients undergoing surgery as well as for some patients in the intensive care units with acute respiratory distress syndrome. However, most anesthetics affect cardiac and respiratory functions. Hence, it is important to monitor and control the infusion of anesthetics to meet sedation requirements while keeping patient vital parameters within safe limits. The critical task of anesthesia administration also necessitates that drug dosing be optimal, patient specific, and robust. In this paper, the concept of reinforcement learning (RL) is used to develop a closed-loop anesthesia controller using the bispectral index (BIS) as a control variable while concurrently accounting for mean arterial pressure (MAP). In particular, the proposed framework uses these two parameters to control propofol infusion rates to regulate the BIS and MAP within a desired range. Specifically, a weighted combination of the error of the BIS and MAP signals is considered in the proposed RL algorithm. This reduces the computational complexity of the RL algorithm and consequently the controller processing time.

关键词： closed loop systems computational complexity learning (artificial intelligence) medical computing medical control systems surgery BIS MAP RL acute respiratory distress syndrome anesthesia administration anesthetics bispectral index closed-loop control mean arterial pressure patient surgery reinforcement learning Anesthesia Biomedical monitoring Blood pressure Drugs Indexes learning (artificial intelligence) Optimal control Anesthetics complexity classes learning (artificial intelligence) Mean Arterial Pressure Adult Respiratory Distress Syndrome Anesthesia medical control systems Biomedical monitoring bispectral index medical computing Closed loop systems manufacturing automation protocol MITIGATION ACTION PLANS control ring Optimal control

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：