检索结果-内蒙古大学图书馆

Transfer reinforcement learning for multi-agent pursuit-evasion differential game with obstacles in a continuous environment

引用

ASIAN JOURNAL OF CONTROL 2024年第4期26卷 2125-2140页

作者： Hu, Penglin Pan, Quan Zhao, Chunhui Guo, Yaning Northwestern Polytech Univ Sch Automat Xian 710130 Peoples R China

In this paper, we study the multi-pursuer single-evader pursuit-evasion (MSPE) differential game in a continuous environment with the consideration of obstacles. We propose a novel pursuit-evasion algorithm based on reinforcement learning and transfer learning. In the source task learning stage, we employ the Q-learning and value function approximation method to overcome the challenges posed by the large-scale storage space required by the conventional Q-table solution method. This approach expands the discrete space to the continuous space by value function approximation and effectively reduces the demand for storage space. During the target task learning stage, we utilize the Gaussian mixture model (GMM) to classify the source tasks. The source policies whose corresponding state-value sets have the highest probability densities are assigned for the agent in the target task for learning. This methodology not only effectively avoids negative transfer but also enhances the algorithm's generalization ability and convergence speed. Through simulation and experiment, we demonstrate the algorithm's effectiveness.

关键词： Gaussian mixture model pursuit-evasion game Q-learning recursive least squares transfer reinforcement learning value function approximation

来源：评论

学校读者我要写书评

暂无评论

A generalized energy management framework for hybrid construction vehicles via model-based reinforcement learning

引用

ENERGY 2022年 260卷

作者： Zhang, Wei Wang, Jixin Xu, Zhenyu Shen, Yuying Gao, Guangzong Jilin Univ Sch Mech & Aerosp Engn Changchun 130012 Peoples R China State Key Lab Smart Mfg Special Vehicles & Transm Baotou 014030 Peoples R China Huazhong Univ Sci & Technol State Key Lab Digital Mfg Equipment & Technol Wuhan Peoples R China

Hybrid construction vehicles (HCVs) have more specific tasks and highly repetitive patterns than on-road ve-hicles. Consequently, they are more suitable for model-based energy management. However, distinctions be-tween work cycles result in adverse scenarios for generalizing model-based energy management. In this study, we solve this problem by proposing a generalized strategy using a model-based reinforcement learning framework. The generalized design highlights three aspects: 1) long-term stability, 2) self-learning ability, and 3) state transition model reuse. A reward function with a trend term is proposed to avoid the cumulative errors between operation cycles and improve the long-term stability of learning. In addition, Gaussian process regression is leveraged to approximate the value function, thereby reducing the computational load and improving the learning efficiency. To further enhance the reusability of the environmental model, a modelling method based on the Gaussian mixture model is put forward. Finally, a generalized HCV energy management framework that includes offline and online learning is designed, where a pre-learning model and an approximation function are adopted for reuse and dynamic learning. Simulation results demonstrate the superiority of the proposed framework to conventional model-based methods in terms of stability, generality, and adaptability, accompanied by a reduction of 5.9% in fuel consumption.

关键词： Hybrid construction vehicle Energy management Model-based learning value function approximation Gaussian mixture model

来源：评论

学校读者我要写书评

暂无评论

Sparse online kernelized actor-critic Learning in reproducing kernel Hilbert space

引用

ARTIFICIAL INTELLIGENCE REVIEW 2022年第1期55卷 23-58页

作者： Yang, Yongliang Zhu, Hufei Zhang, Qichao Zhao, Bo Li, Zhenning Wunsch, Donald C. Univ Sci & Technol Beijing Sch Automat & Elect Engn Beijing 100083 Peoples R China Chinese Acad Sci Inst Automat State Key Lab Management & Control Complex Syst Beijing 100190 Peoples R China Univ Chinese Acad Sci Beijing Peoples R China Beijing Normal Univ Sch Syst Sci Beijing 100875 Peoples R China Univ Macau State Key Lab Internet Things Smart City Taipa 59193 Macao Peoples R China Missouri Univ Sci & Technol Dept Elect & Comp Engn Rolla MO 65401 USA

In this paper, we develop a novel non-parametric online actor-critic reinforcement learning (RL) algorithm to solve optimal regulation problems for a class of continuous-time affine nonlinear dynamical systems. To deal with the value function approximation (VFA) with inherent nonlinear and unknown structure, a reproducing kernel Hilbert space (RKHS)-based kernelized method is designed through online sparsification, where the dictionary size is fixed and consists of updated elements. In addition, the linear independence check condition, i.e., an online criteria, is designed to determine whether the online data should be inserted into the dictionary. The RHKS-based kernelized VFA has a variable structure in accordance with the online data collection, which is different from classical parametric VFA methods with a fixed structure. Furthermore, we develop a sparse online kernelized actor-critic learning RL method to learn the unknown optimal value function and the optimal control policy in an adaptive fashion. The convergence of the presented kernelized actor-critic learning method to the optimum is provided. The boundedness of the closed-loop signals during the online learning phase can be guaranteed. Finally, a simulation example is conducted to demonstrate the effectiveness of the presented kernelized actor-critic learning algorithm.

关键词： Reproducing kernel Hilbert space Actor-critic learning value function approximation Online sparsification Non-parametric learning

来源：评论

学校读者我要写书评

暂无评论

Deep reinforcement learning using least-squares truncated temporal-difference

引用

CAAI Transactions on Intelligence Technology 2024年第2期9卷 425-439页

作者： Junkai Ren Yixing Lan Xin Xu Yichuan Zhang Qiang Fang Yujun Zeng College of Intelligence Science and Technology National University of Defense TechnologyChangshaChina State Key Laboratory of Astronautic Dynamics Xi'an Satellite Control CenterXi'anChina

Policy evaluation(PE)is a critical sub-problem in reinforcement learning,which estimates the value function for a given policy and can be used for policy ***,there still exist some limitations in current PE methods,such as low sample efficiency and local convergence,especially on complex *** this study,a novel PE algorithm called Least-Squares Truncated Temporal-Difference learning(LST2D)is *** LST2D,an adaptive truncation mechanism is designed,which effectively takes advantage of the fast convergence property of Least-Squares Temporal Difference learning and the asymptotic convergence property of Temporal Difference learning(TD).Then,two feature pre-training methods are utilised to improve the approximation ability of ***,an Actor-Critic algorithm based on LST2D and pre-trained feature representations(ACLPF)is proposed,where LST2D is integrated into the critic network to improve learning-prediction *** simulation studies were conducted on four robotic tasks,and the corresponding results illustrate the effectiveness of *** proposed ACLPF algorithm outperformed DQN,ACER and PPO in terms of sample efficiency and stability,which demonstrated that LST2D can be applied to online learning control problems by incorporating it into the actor-critic architecture.

关键词： Deep reinforcement learning policy evaluation temporal difference value function approximation

来源：评论

学校读者我要写书评

暂无评论

A Fast Technique for Smart Home Management: ADP With Temporal Difference Learning

引用

IEEE TRANSACTIONS ON SMART GRID 2018年第4期9卷 3291-3303页

作者： Keerthisinghe, Chanaka Verbic, Gregor Chapman, Archie C. Univ Sydney Sch Elect & Informat Engn Sydney NSW 2006 Australia

This paper presents a computationally efficient smart home energy management system (SHEMS) using an approximate dynamic programming (ADP) approach with temporal difference learning for scheduling distributed energy resources. This approach improves the performance of an SHEMS by incorporating stochastic energy consumption and PV generation models over a horizon of several days, using only the computational power of existing smart meters. In this paper, we consider a PV-storage (thermal and battery) system, however, our method can extend to multiple controllable devices without the exponential growth in computation that other methods such as dynamic programming (DP) and stochastic mixed-integer linear programming (MILP) suffer from. Specifically, probability distributions associated with the PV output and demand are kernel estimated from empirical data collected during the Smart Grid Smart City project in NSW, Australia. Our results show that ADP computes a solution much faster than both DP and stochastic MILP, and provides only a slight reduction in quality compared to the optimal DP solution. In addition, incorporating a thermal energy storage unit using the proposed ADP-based SHEMS reduces the daily electricity cost by up to 263% without a noticeable increase in the computational burden. Moreover, ADP with a two-day decision horizon reduces the average yearly electricity cost by a 4.6% over a daily DP method, yet requires less than half of the computational effort.

关键词： Demand response smart home energy management distributed energy resources approximate dynamic programming dynamic programming stochastic mixed-integer linear programming value function approximation temporal difference learning

来源：评论

学校读者我要写书评

暂无评论

value function gradient learning for large-scale multistage stochastic programming problems

引用

EUROPEAN JOURNAL OF OPERATIONAL RESEARCH 2023年第1期308卷 321-335页

作者： Lee, Jinkyu Bae, Sanghyeon Kim, Woo Chang Lee, Yongjae Korea Adv Inst Sci & Technol KAIST Dept Ind & Syst Engn Daejeon South Korea Ulsan Natl Inst Sci & Technol UNIST Dept Ind Engn Ulsan South Korea

A stagewise decomposition algorithm called "value function gradient learning" (VFGL) is proposed for large-scale multistage stochastic convex programs. VFGL finds the parameter values that best fit the gra-dient of the value function within a given parametric family. Widely used decomposition algorithms for multistage stochastic programming, such as stochastic dual dynamic programming (SDDP), approximate the value function by adding linear subgradient cuts at each iteration. Although this approach has been successful for linear problems, nonlinear problems may suffer from the increasing size of each subprob-lem as the iteration proceeds. On the other hand, VFGL has a fixed number of parameters;thus, the size of the subproblems remains constant throughout the iteration. Furthermore, VFGL can learn the param-eters by means of stochastic gradient descent, which means that it can be easil0y parallelized and does not require a scenario tree approximation of the underlying uncertainties. VFGL was compared with a deterministic equivalent formulation of the multistage stochastic programming problem and SDDP ap-proaches for three illustrative examples: production planning, hydrothermal generation, and the lifetime financial planning problem. Numerical examples show that VFGL generates high-quality solutions and is computationally efficient.(c) 2022 Elsevier B.V. All rights reserved.

关键词： Decision processes Large-scale optimization Multistage stochastic programming Stagewise decomposition value function approximation

来源：评论

学校读者我要写书评

暂无评论

Adaptive dynamic programming for robust neural control of unknown continuous-time non-linear systems

引用

IET CONTROL THEORY AND APPLICATIONS 2017年第14期11卷 2307-2316页

作者： Yang, Xiong He, Haibo Liu, Derong Zhu, Yuanheng Tianjin Univ Sch Elect & Informat Engn Tianjin 300072 Peoples R China Univ Rhode Isl Dept Elect Comp & Biomed Engn Kingston RI 02881 USA Guangdong Univ Technol Sch Automat Guangzhou 510006 Guangdong Peoples R China Chinese Acad Sci Inst Automat State Key Lab Management & Control Complex Syst Beijing 100190 Peoples R China

The design of robust controllers for continuous-time (CT) non-linear systems with completely unknown non-linearities is a challenging task. The inability to accurately identify the non-linearities online or offline motivates the design of robust controllers using adaptive dynamic programming (ADP). In this study, an ADP-based robust neural control scheme is developed for a class of unknown CT non-linear systems. To begin with, the robust non-linear control problem is converted into a non-linear optimal control problem via constructing a value function for the nominal system. Then an ADP algorithm is developed to solve the non-linear optimal control problem. The ADP algorithm employs actor-critic dual networks to approximate the control policy and the value function, respectively. Based on this architecture, only system data is necessary to update simultaneously the actor neural network (NN) weights and the critic NN weights. Meanwhile, the persistence of excitation assumption is no longer required by using the Monte Carlo integration method. The closed-loop system with unknown non-linearities is demonstrated to be asymptotically stable under the obtained optimal control. Finally, two examples are provided to validate the developed method.

关键词： dynamic programming robust control neurocontrollers continuous time systems control system synthesis nonlinear control systems optimal control function approximation Monte Carlo methods closed loop systems asymptotic stability adaptive dynamic programming robust neural control design unknown continuous-time nonlinear systems CT nonlinear systems ADP-based robust neural control scheme robust nonlinear control problem nonlinear optimal control problem nominal system ADP algorithm actor-critic dual networks control policy approximation value function approximation actor neural network weights critic NN weights Monte Carlo integration method closed-loop system asymptotically stability

来源：评论

学校读者我要写书评

暂无评论

Link Analysis for Solving Multiple-Access MDPs With Large State Spaces

引用

IEEE TRANSACTIONS ON SIGNAL PROCESSING 2023年 71卷 947-962页

作者： Bozkus, Talha Mitra, Urbashi Univ Southern Calif Ming Hsieh Dept Elect & Comp Engn Los Angeles CA 90007 USA

Wireless communication networks can be well-modeled by Markov Decision Processes (MDPs). While traditional dynamic programming algorithms such as value and policy iteration have lower complexity than brute force strategies, they still suffer from complexity issues for large state spaces. In this paper, the development of moderate complexity algorithms with high performance is sought for wireless network control. To this end, the approximate value function can be computed by projecting the original value function into a lower-dimensional subspace with the careful choice of basis vectors using tools from graph signal processing (GSP). Although GSP theory mainly focuses on undirected graphs, the transition graphs of MDPs for wireless networks are generally directed. For this reason, graph symmetrization based on the co-link method is considered due to its ability to preserve multi-hop dependencies in the graph. Given the multiple-access model under consideration, key properties of the transition probability matrix are exploited to determine the optimal subspace for the approximate value function with low complexity. The numerical results for a multiple-access channel show that the proposed algorithm can find the optimal basis vectors with high accuracy, and furthermore, the approach is robust to changes in system parameters. It is also shown that the projected equation method outperforms the state aggregation technique in producing higher accuracy with a lower runtime complexity.

关键词： Complexity theory Signal processing Directed graphs Wireless networks Signal processing algorithms Costs approximation algorithms Markov decision process (MDP) value function approximation wireless networks multiple-access link analysis graph signal processing graph symmetrization

来源：评论

学校读者我要写书评

暂无评论

Incremental Sparse Bayesian Method for Online Dialog Strategy Learning

引用

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 2012年第8期6卷 903-916页

作者： Lee, Sungjin Eskenazi, Maxine Carnegie Mellon Univ Language Technol Inst Pittsburgh PA 15213 USA

This paper proposes an incremental sparse Bayesian learning method to allow continuous dialog strategy learning from the interactions with real users. Since conventional reinforcement learning (RL) methods require a huge number of dialogs to reach convergence, it has been essential to use a simulated user in training dialog policies. The disadvantage of this approach is that the trained dialog policies always lag behind the optimal one for live users. In order to tackle this problem, a few studies applying online RL methods to dialog management have emerged and showed very promising results. However, these methods are limited to learning online the weight parameters of the basis functions in the model and so need batch learning on a fixed data set or some heuristics to find appropriate values for other meta parameters such as sparsity-controlling thresholds, basis function parameters, and noise parameters. The proposed method attempts to overcome this limitation to achieve fully incremental and fast dialog strategy learning by adopting a sparse Bayesian learning method for value function approximation. In order to verify the proposed method, three different experimental conditions have been used: artificial data, a simulated user, and real users. The experiment on the artificial data showed that the proposed method successfully learns all the parameters in an incremental manner. Also, the experiment on training and evaluating dialog policies with a simulated user clearly demonstrated that the proposed method is much faster than conventional RL methods. A live user study showed that the dialog strategy learned from real users performed as good as the best past systems, although it slightly underperformed the one trained on simulated dialogs due to the difficulty of user feedback elicitation.

关键词： Incremental learning reinforcement learning sparse Bayesian modeling statistical dialog modeling value function approximation

来源：评论

学校读者我要写书评

暂无评论

The Control of Invasive Species on Private Property with Neighbor-to-Neighbor Spillovers

引用

ENVIRONMENTAL & RESOURCE ECONOMICS 2014年第2期59卷 231-255页

作者： Fenichel, Eli P. Richards, Timothy J. Shanafelt, David W. Yale Univ Sch Forestry & Environm Studies New Haven CT 06511 USA Arizona State Univ Morrison Sch Agribusiness & Resource Management Mesa AZ 85212 USA Arizona State Univ Sch Life Sci Tempe AZ 85287 USA

Invasive pests cross property boundaries. Property managers may have private incentives to control invasive species despite not having sufficient incentive to fully internalize the external costs of their role in spreading the invasion. Each property manager has a right to future use of his own property, but his property may abut others' properties enabling spread of an invasive species. The incentives for a foresighted property manager to control invasive species have received little attention. We consider the efforts of a foresighted property manager who has rights to future use of a property and has the ability to engage in repeated, discrete control activities. We find that higher rates of dispersal, associated with proximity to neighboring properties, reduce the private incentives for control. Controlling species at one location provides incentives to control at a neighboring location. Control at neighboring locations are strategic complements and coupled with spatial heterogeneity lead to a weaker-link public good problem, in which each property owner is unable to fully appropriate the benefits of his own control activity. Future-use rights and private costs suggest that there is scope for a series of Coase-like exchanges to internalize much of the costs associated with species invasion. Pigouvian taxes on invasive species potentially have qualitatively perverse behavioral effects. A tax with a strong income effect (e.g., failure of effective revenue recycling) can reduce the value of property assets and diminish the incentive to manage insects on one's own property.

关键词： Asian citrus psyllid Bioeconomics Citrus Dynamic programming Invasive species Property rights Repeat optimal stopping Spatial externalities value function approximation

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：