检索结果-内蒙古大学图书馆

Direct and indirect reinforcement learning

INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS 2021年第8期36卷 4439-4467页

作者： Guan, Yang Li, Shengbo Eben Duan, Jingliang Li, Jie Ren, Yangang Sun, Qi Cheng, Bo Tsinghua Univ Sch Vehicle & Mobil Beijing 100084 Peoples R China

Reinforcement learning (RL) algorithms have been successfully applied to a range of challenging sequential decision-making and control tasks. In this paper, we classify RL into direct and indirect RL according to how they seek the optimal policy of the Markov decision process problem. The former solves the optimal policy by directly maximizing an objective function using gradient descent methods, in which the objective function is usually the expectation of accumulative future rewards. The latter indirectly finds the optimal policy by solving the Bellman equation, which is the sufficient and necessary condition from Bellman's principle of optimality. We study policy gradient (PG) forms of direct and indirect RL and show that both of them can derive the actor-critic architecture and can be unified into a PG with the approximate value function and the stationary state distribution, revealing the equivalence of direct and indirect RL. We employ a Gridworld task to verify the influence of different forms of PG, suggesting their differences and relationships experimentally. Finally, we classify current mainstream RL algorithms using the direct and indirect taxonomy, together with other ones, including value-based and policy-based, model-based and model-free.

关键词： actor-critic approximate dynamic programming direct method indirect method reinforcement learning

来源：评论

学校读者我要写书评

暂无评论

Robust Reinforcement Learning with Diffusion Wavelets

Robust Reinforcement Learning with Diffusion Wavelets

引用

作者： Seyedmazloom, Ali George Mason University

学位级别：Ph.D., Doctor of Philosophy

Reinforcement Learning is a method of learning from the environment by constantly observing it and evaluating its response to a set of actions. Long-term learning of a dynamic system for aforementioned interactions where its constituents evolve temporally, accumulates valuable knowledge about system’s history and actions taken. Reinforcement Learning (RL) takes advantage of this gathered knowledge to identify an optimal policy that essentially dictates which decision(s) should be made when the system is in a certain state in order to guide it to achieve the best objective value in the long run. Since system’s objective value is nothing but accumulated discounted rewards over various system states, it is crucial to define and retrieve rewards from the environment as genuinely as possible. However, there are several instances where we might be receiving noisy, corrupt, or even intentionally perturbed state perceptions that contain distorted rewards. In robotics for example, it is quite common to receive noisy feedback through faulty sensors or adverse environment because of rain, wind, lack of light, etc. The focus of this research is to study and utilize methods to properly handle intentionally corrupted discrete-state perceptions and their associated reward channel that are crafted in a way which will increase the likelihood of the model taking actions not in line with its genuine objectives but of an unknown adversary’s. This may seriously compromise RL model’s functionality and addressing it, is of utmost importance in Secure Machine Learning domain as it has been the case with Artificial Neural Network models targeted by the “Adversarial Attacks”. A value-based RL model needs to assess all possible decisions based on the quality of the upcoming states and these quality values (known as q-values) should be obtained from a robust process. While many state-of-the-art RL models use Artificial Neural Networks for providing state values to the RL agent, in this researc

关键词： approximate dynamic programming Diffusion wavelets Operations research Reinforcement learning Robust learning

来源：评论

学校读者我要写书评

暂无评论

A Stochastic Spatiotemporal Decomposition Decision-Making Approach for Real-Time dynamic Energy Management of Multi-Microgrids

引用

IEEE TRANSACTIONS ON SUSTAINABLE ENERGY 2021年第2期12卷 821-833页

作者： Mo, Xiemin Zhu, Jianquan Chen, Jiajun Guo, Ye Xia, Yunrui Liu, Mingbo South China Univ Technol Sch Elect Power Engn Guangzhou 510640 Peoples R China

This paper studies the real-time dynamic energy management (DEM) of multi microgrids (MMGs) considering active and reactive power flow constraints, voltage constraints, battery operational characters, and uncertainties in the renewable generation and load. A stochastic spatiotemporal decomposition decision-making framework is proposed based on approximate dynamic programming (ADP) to make decentralized decisions in both spatial and temporal dimensions. The tie-line power and the state of charge (SOC) of the battery are coordinated in real time to deal with the uncertainties in the upcoming future, preserving the decision independence of MGs and periods. And the shift factors are derived to consider active and reactive power flow constraints of MMGs in the stochastic spatiotemporal decomposition framework. Moreover, the historical information is utilized offline to avoid the dependency on forecast information and iterative calculation, while near-optimal solutions can be obtained in real time. Case studies on several MMG test systems and a real MMG system demonstrate the effectiveness of the proposed approach.

关键词： Spatiotemporal phenomena Batteries State of charge Reactive power Real-time systems Load flow Decision making Multi microgrids spatiotemporal decomposition approximate dynamic programming uncertainty power flow constraint shift factor method

来源：评论

学校读者我要写书评

暂无评论

Adaptive dynamic programming for Control: A Survey and Recent Advances

引用

IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS 2021年第1期51卷 142-160页

作者： Liu, Derong Xue, Shan Zhao, Bo Luo, Biao Wei, Qinglai Guangdong Univ Technol Sch Automat Guangzhou 510006 Peoples R China South China Univ Technol Sch Comp Sci & Engn Guangzhou 510006 Peoples R China Beijing Normal Univ Sch Syst Sci Beijing 100875 Peoples R China Cent South Univ Sch Automat Changsha 410083 Peoples R China Peng Cheng Lab Shenzhen 518000 Peoples R China Chinese Acad Sci Inst Automat State Key Lab Management & Control Complex Syst Beijing 100190 Peoples R China Univ Chinese Acad Sci Beijing 100049 Peoples R China

This article reviews the recent development of adaptive dynamic programming (ADP) with applications in control. First, its applications in optimal regulation are introduced, and some skilled and efficient algorithms are presented. Next, the use of ADP to solve game problems, mainly nonzero-sum game problems, is elaborated. It is followed by applications in large-scale systems. Note that although the functions presented in this article are based on continuous-time systems, various applications of ADP in discrete-time systems are also analyzed. Moreover, in each section, not only some existing techniques are discussed, but also possible directions for future work are pointed out. Finally, some overall prospects for the future are given, followed by conclusions of this article. Through a comprehensive and complete investigation of its applications in many existing fields, this article fully demonstrates that the ADP intelligent control method is promising in today's artificial intelligence era. Furthermore, it also plays a significant role in promoting economic and social development.

关键词： Adaptive critic designs (ACDs) adaptive dynamic programming approximate dynamic programming intelligent control learning control neural dynamic programming neuro-dynamic programming optimal control reinforcement learning (RL)

来源：评论

学校读者我要写书评

暂无评论

Electric Vehicle Routing with Public Charging Stations

引用

TRANSPORTATION SCIENCE 2021年第3期55卷 637-659页

作者： Kullman, Nicholas D. Goodson, Justin C. Mendoza, Jorge E. Univ Tours CNRS LIFAT EA 6300 ROOT ERL CNRS 7002 F-37200 Tours France St Louis Univ Richard A Chaifetz Sch Business St Louis MO 63103 USA HEC Montreal Montreal PQ H3T 2A7 Canada Ctr Interuniv Rech Reseaux Entreprise Logist & Tr Montreal PQ H3T 1J4 Canada

We introduce the electric vehicle routing problem with public-private recharging strategy in which vehicles may recharge en route at public charging infrastructure as well as at a privately-owned depot. To hedge against uncertain demand at public charging stations, we design routing policies that anticipate station queue dynamics. We leverage a decomposition to identify good routing policies, including the optimal static policy and fixed-route-based rollout policies that dynamically respond to observed queues. The decomposition also enables us to establish dual bounds, providing a measure of goodness for our routing policies. In computational experiments using real instances from industry, we show the value of our policies to be within 10% of a dual bound. Furthermore, we demonstrate that our policies significantly outperform the industry-standard routing strategy in which vehicle recharging generally occurs at a central depot. Our methods stand to reduce the operating costs associated with electric vehicles, facilitating the transition from internal-combustion engine vehicles.

关键词： dynamic vehicle routing electric vehicles fixed routes information relaxation information penalties approximate dynamic programming

来源：评论

学校读者我要写书评

暂无评论

dynamic Repair Scheduling for Transmission Systems Based on Look-Ahead Strategy Approximation

引用

IEEE TRANSACTIONS ON POWER SYSTEMS 2021年第4期36卷 2918-2933页

作者： Yan, Jiahao Hu, Bo Xie, Kaigui Niu, Tao Li, Chunyan Tai, Heng-Ming Chongqing Univ State Key Lab Power Transmiss Equipment & Syst Se Chongqing 400030 Peoples R China Univ Tulsa Dept Elect & Comp Engn Tulsa OK 74104 USA

This paper intends to address the dynamic repair scheduling of electric power transmission systems based on look-ahead strategy approximation. The objective is to minimize system functionality loss during the restoration stage after disruptive events. A series of decisions regarding which damaged component to be repaired has to be made successively considering currently available information of repair time and its uncertainty in the future. To achieve this goal, the dynamic repair scheduling problem is represented as a stochastic Markovian decision process (MDP). To overcome the computational complexity of MDP derived from exponentially growing state space, the cost-to-go function is approximated by a look-ahead strategy based on repair importance ordering. Stage-dependent coefficients are used to balance the approximated functionality loss at different decision stages. The tradeoff between the efficiency and optimality can be achieved by adjusting the look-ahead depth and the updating policy of look-ahead strategy. The IEEE-14 and 118-bus systems were used for performance evaluation of the proposed method and comparison with various approaches. The results show that it can produce decisions close to the best-known solutions within small amount of time.

关键词： Maintenance engineering Task analysis Indexes Power transmission lines Schedules dynamic scheduling Generators Transmission restoration dynamic repair scheduling look-ahead strategy approximate dynamic programming repair importance ordering

来源：评论

学校读者我要写书评

暂无评论

Continuous-Time Distributed Policy Iteration for Multicontroller Nonlinear Systems

引用

IEEE TRANSACTIONS ON CYBERNETICS 2021年第5期51卷 2372-2383页

作者： Wei, Qinglai Li, Hongyang Yang, Xiong He, Haibo Chinese Acad Sci Inst Automat State Key Lab Management & Control Complex Syst Beijing 100190 Peoples R China Univ Chinese Acad Sci Sch Artificial Intelligence Beijing 100049 Peoples R China Qingdao Acad Intelligent Ind Qingdao 266109 Peoples R China Tianjin Univ Sch Elect & Informat Engn Tianjin 300072 Peoples R China Univ Rhode Isl Dept Elect Comp & Biomed Engn Kingston RI 02881 USA

In this article, a novel distributed policy iteration algorithm is established for infinite horizon optimal control problems of continuous-time nonlinear systems. In each iteration of the developed distributed policy iteration algorithm, only one controller's control law is updated and the other controllers' control laws remain unchanged. The main contribution of the present algorithm is to improve the iterative control law one by one, instead of updating all the control laws in each iteration of the traditional policy iteration algorithms, which effectively releases the computational burden in each iteration. The properties of distributed policy iteration algorithm for continuous-time nonlinear systems are analyzed. The admissibility of the present methods has also been analyzed. Monotonicity, convergence, and optimality have been discussed, which show that the iterative value function is nonincreasingly convergent to the solution of the Hamilton-Jacobi-Bellman equation. Finally, numerical simulations are conducted to illustrate the effectiveness of the proposed method.

关键词： Optimal control Nonlinear systems Decentralized control Mathematical model Convergence Multi-agent systems Adaptive dynamic programming (ADP) approximate dynamic programming distributed policy iteration nonlinear systems optimal control

来源：评论

学校读者我要写书评

暂无评论

Allocating resources via price management systems: a dynamic programming-based approach

引用

INTERNATIONAL JOURNAL OF CONTROL 2021年第8期94卷 2123-2143页

作者： Forootani, Ali Liuzza, Davide Tipaldi, Massimo Glielmo, Luigi Univ Sannio Dept Engn Piazza Roma 21 I-82100 Benevento Italy ENEA Fus & Nucl Safety Dept Frascati Rome Italy

In this paper, a novel model for price management systems in resource allocation problems is proposed. Stochastic customer requests for resource allocations and releases are modelled as constrained parallel Birth-Death Processes (BDP). We address both instant (i.e. the customer requires a resource to be allocated immediately) and advance (i.e. the customer books a resource for future use) reservation requests, the latter with both bounded and unbounded time interval options. Algorithms based on dynamic programming (DP) principles are proposed for the calculation of suitable price profiles. At the core of such algorithms, there is the resolution of stochastic optimisation problems. In particular, the maximisation of the expected total revenue is formulated via a constrained Stochastic dynamic programming (SDP) approach, which becomes time-variant in case of advance reservation requests. approximate dynamic programming (ADP) techniques are adopted in case of large state spaces. Simulations are performed to show the effectiveness of the proposed models and the related algorithms.

关键词： Price management systems resource allocation problems stochastic dynamic programming Markov decision process approximate dynamic programming

来源：评论

学校读者我要写书评

暂无评论

Modified value-function-approximation for synchronous policy iteration with single-critic configuration for nonlinear optimal control

引用

INTERNATIONAL JOURNAL OF CONTROL 2021年第5期94卷 1321-1333页

作者： Tang, Difan Chen, Lei Tian, Zhao Feng Hu, Eric Univ Adelaide Sch Mech Engn Adelaide SA Australia

This study proposes a modified value-function-approximation (MVFA) and investigates its use under a single-critic configuration based on neural networks (NNs) for synchronous policy iteration (SPI) to deliver compact implementation of optimal control online synthesis for control-affine continuous-time nonlinear systems. Existing single-critic algorithms require stabilising critic tuning laws while eliminating actor tuning. This paper thus studies alternative single-critic realisation aiming to relax the needs for stabilising mechanisms in the critic tuning law. Optimal control laws are determined from the Hamilton-Jacobi-Bellman equality by solving for the associated value function via SPI in a single-critic configuration. Different from other existing single-critic methods, an MVFA is proposed to deal with closed-loop stability during online learning. Gradient-descent tuning is employed to adjust the critic NN parameters in the interests of not complicating the problem. Parameters convergence and closed-loop system states stability are examined. The proposed MVFA approach yields an alternative single-critic SPI method with uniformly ultimately bounded NN parameter convergence and asymptotic closed-loop system states stability throughout the process of online learning without the need for stabilising mechanisms in the tuning law for critic NN. The proposed approach is verified via simulations.

关键词： Adaptive dynamic programming approximate dynamic programming neural networks nonlinear control optimal control policy iteration

来源：评论

学校读者我要写书评

暂无评论

Nonsmooth Data-Based Reinforcement Learning for Online approximate Optimal Control

Nonsmooth Data-Based Reinforcement Learning for Online Appro...

引用

作者： Greene, Max Lewis University of Florida

学位级别：Ph.D., Doctor of Philosophy

Autonomous systems are often constrained by time-critical mission constraints and limited power. Such constraints motivate optimality in mission execution. Reinforcement learning (RL) has become a tool to facilitate learning of a desired optimal control policies online, which achieve a desired objective. approximate dynamic programming (ADP) is a RL-based techniques that generates a forward-in-time approximation of the optimal optimal value function (and in-turn the control policy) for dynamical systems with continuous state and action spaces. Developments in regional model-based RL (R-MBRL) facilitate improved online approximation of the value function. R-MBRL approximates the value function over a compact set of the state space and facilitates learning by approximating and evaluating the optimal value function at multiple points on this compact set. This dissertation investigates numerous modifications to R-MBRL ADP to improve computational efficiency, for application to a broader class of dynamical systems, and to incorporate different function approximation techniques. These modifications introduce discontinuities into the otherwise smooth signals, which are analyzed via Lyapunov-based techniques. Chapter 3 presents a technique to reduce the computational expense of performing R-MBRL across an arbitrarily large number of points in the state space. Without modification, existing R-MBRL algorithms evaluate the quality of the value function approximation at many user-defined points on the state space using a conventional neural network (NN). The method presented in Chapter 3 improves on the existing techniques by segmenting the state space and using sparse neural networks (SNNs) to facilitate learning. By segmenting the state space, the cognitive agent can switch between different subsets of the state space over which to evaluate the optimal policy. Furthermore, using a SNN reduces the overall number of operations needed to evaluate the optimal policy. Combined, th

关键词： Reinforcement learning Autonomous systems approximate dynamic programming Barrier function transformations Switched systems

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：