检索结果-内蒙古大学图书馆

Restricted gradient-descent algorithm for value-function approximation in reinforcement learning

ARTIFICIAL INTELLIGENCE 2008年第4-5期172卷 454-482页

作者： Salles Barreto, Andre da Motta Anderson, Charles W. Univ Fed Rio de Janeiro COPPE Programa Engn Civil BR-21945 Rio De Janeiro Brazil Colorado State Univ Dept Comp Sci Ft Collins CO 80523 USA

This work presents the restricted gradient-descent (RGD) algorithm, a training method for local radial-basis function networks specifically developed to be used in the context of reinforcement learning. The RGD algorithm can be seen as a way to extract relevant features from the state space to feed a linear model computing an approximation of the value function. Its basic idea is to restrict the way the standard gradient-descent algorithm changes the hidden units of the approximator, which results in conservative modifications that make the learning process less prone to divergence. The algorithm is also able to configure the topology of the network, an important characteristic in the context of reinforcement learning, where the changing policy may result in different requirements on the approximator structure. Computational experiments are presented showing that the RGD algorithm consistently generates better value-function approximations than the standard gradient-descent method, and that the latter is more susceptible to divergence. In the pole-balancing and Acrobot tasks, RGD combined with SARSA presents competitive results with other methods found in the literature, including evolutionary and recent reinforcement-learning algorithms. (c) 2007 Elsevier B.V. All rights reserved.

关键词： reinforcement learning neuro-dynamic programming value-function approximation radial-basis-function networks

来源：评论

学校读者我要写书评

暂无评论

Relative value function approximation for the capacitated re-entrant line scheduling problem

引用

IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING 2005年第3期2卷 285-299页

作者： Choi, JY Reveliotis, S Georgia Inst Technol Sch Ind & Syst Engn Atlanta GA 30332 USA

The problem addressed in this study is that of determining how to allocate the workstation processing and buffering capacity in a capacitated re-entrant line to the job instances competing for it, in order to maximize its long-run/steady-state throughput, while maintaining the logical correctness of the underlying material flow, i.e., deadlock-free operations. An approximation scheme for the optimal policy that is based on neuro-dynamic programming theory is proposed, and its performance is assessed through a numerical experiment. The derived results indicate that the proposed method holds considerable promise for providing a viable, computationally efficient approach to the problem and highlight directions for further investigation.

关键词： capacitated re-entrant line (CRL) neuro-dynamic programming relative value function approximation scheduling

来源：评论

学校读者我要写书评

暂无评论

Reinforcement-Learning-Based Robust Controller Design for Continuous-Time Uncertain Nonlinear Systems Subject to Input Constraints

引用

IEEE TRANSACTIONS ON CYBERNETICS 2015年第7期45卷 1372-1385页

作者： Liu, Derong Yang, Xiong Wang, Ding Wei, Qinglai Chinese Acad Sci Inst Automat State Key Lab Management & Control Complex Syst Beijing 100190 Peoples R China

The design of stabilizing controller for uncertain nonlinear systems with control constraints is a challenging problem. The constrained-input coupled with the inability to identify accurately the uncertainties motivates the design of stabilizing controller based on reinforcement-learning (RL) methods. In this paper, a novel RL-based robust adaptive control algorithm is developed for a class of continuous-time uncertain nonlinear systems subject to input constraints. The robust control problem is converted to the constrained optimal control problem with appropriately selecting value functions for the nominal system. Distinct from typical action-critic dual networks employed in RL, only one critic neural network (NN) is constructed to derive the approximate optimal control. Meanwhile, unlike initial stabilizing control often indispensable in RL, there is no special requirement imposed on the initial control. By utilizing Lyapunov's direct method, the closed-loop optimal control system and the estimated weights of the critic NN are proved to be uniformly ultimately bounded. In addition, the derived approximate optimal control is verified to guarantee the uncertain nonlinear system to be stable in the sense of uniform ultimate boundedness. Two simulation examples are provided to illustrate the effectiveness and applicability of the present approach.

关键词： Approximate dynamic programming (ADP) neural networks (NNs) neuro-dynamic programming nonlinear systems optimal control reinforcement learning (RL) robust control

来源：评论

学校读者我要写书评

暂无评论

neuro-Optimal Control for Discrete Stochastic Processes via a Novel Policy Iteration Algorithm

引用

IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS 2020年第11期50卷 3972-3985页

作者： Liang, Mingming Wang, Ding Liu, Derong Chinese Acad Sci Inst Automat State Key Lab Management & Control Complex Syst Beijing 100190 Peoples R China Beijing Univ Technol Fac Informat Technol Beijing 100124 Peoples R China Beijing Univ Technol Beijing Key Lab Computat Intelligence & Intellige Beijing 100124 Peoples R China Guangdong Univ Technol Sch Automat Guangzhou 510006 Peoples R China

In this paper, a novel policy iteration adaptive dynamic programming (ADP) algorithm is presented which is called "local policy iteration ADP algorithm" to obtain the optimal control for discrete stochastic processes. In the proposed local policy iteration ADP algorithm, the iterative decision rules are updated in a local space of the whole state space. Hence, we can significantly reduce the computational burden for the CPU in comparison with the conventional policy iteration algorithm. By analyzing the convergence properties of the proposed algorithm, it is shown that the iterative value functions are monotonically nonincreasing. Besides, the iterative value functions can converge to the optimum in a local policy space. In addition, this local policy space will be described in detail for the first time. Under a few weak constraints, it is also shown that the iterative value function will converge to the optimal performance index function of the global policy space. Finally, a simulation example is presented to validate the effectiveness of the developed method.

关键词： Heuristic algorithms Stochastic processes Performance analysis Optimal control Iterative algorithms Aerospace electronics dynamical systems Adaptive critic designs adaptive dynamic programming (ADP) local policy iteration neuro-dynamic programming optimal control stochastic processes

来源：评论

学校读者我要写书评

暂无评论

Rollout algorithms for stochastic scheduling problems

引用

JOURNAL OF HEURISTICS 1999年第1期5卷 89-108页

作者： Bertsekas, DP Castañon, DA MIT Dept Elect Engn & Comp Sci Cambridge MA 02139 USA Boston Univ Dept Elect Engn Burlington MA 01803 USA Alphatech Inc Burlington MA 01803 USA

Stochastic scheduling problems are difficult stochastic control problems with combinatorial decision spaces. In this paper we focus on a class of stochastic scheduling problems, the quiz problem and its variations. We discuss the use of heuristics for their solution, and we propose rollout algorithms based on these heuristics which approximate the stochastic dynamic programming algorithm. We show how the rollout algorithms can be implemented efficiently, with considerable savings in computation over optimal algorithms. We delineate circumstances under which the rollout algorithms are guaranteed to perform better than the heuristics on which they are based. We also show computational results which suggest that the performance of the rollout policies is near-optimal, and is substantially better than the performance of their underlying heuristics.

关键词： rollout algorithms scheduling neuro-dynamic programming

来源：评论

学校读者我要写书评

暂无评论

Asymptotic analysis of temporal-difference learning algorithms with constant step-sizes

引用

MACHINE LEARNING 2006年第2期63卷 107-133页

作者： Tadic, VB Univ Sheffield Dept Automat Control & Syst Engn Sheffield S1 3JD S Yorkshire England

The mean-square asymptotic behavior of temporal-difference learning algorithrns with constant step-sizes and linear function approximation is analyzed in this paper. The analysis is carried out for the case of discounted cost function associated with a Markov chain with a finite dimensional state-space. Under mild conditions, an upper bound for the asymptotic mean-square error of these algorithms is determined as a function of the step-size. Moreover, under the same assumptions, it is also shown that this bound is linear in the step size. The main results of the paper are illustrated with examples related to M/G/1 queues and nonlinear AR models with Markov switching.

关键词： temporal-difference learning neuro-dynamic programming reinforcement learning stochastic approximation Markov chains

来源：评论

学校读者我要写书评

暂无评论

Online Synchronous Approximate Optimal Learning Algorithm for Multiplayer Nonzero-Sum Games With Unknown dynamics

引用

IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS 2014年第8期44卷 1015-1027页

作者： Liu, Derong Li, Hongliang Wang, Ding Chinese Acad Sci Inst Automat State Key Lab Management & Control Complex Syst Beijing 100190 Peoples R China

In this paper, we develop an online synchronous approximate optimal learning algorithm based on policy iteration to solve a multiplayer nonzero-sum game without the requirement of exact knowledge of dynamical systems. First, we prove that the online policy iteration algorithm for the nonzero-sum game is mathematically equivalent to the quasi-Newton's iteration in a Banach space. Then, a model neural network is established to identify the unknown continuous-time nonlinear system using input-output data. For each player, a critic neural network and an action neural network are used to approximate its value function and control policy, respectively. Our algorithm only needs to tune the weights of critic neural networks, so there will be less computational complexity during the learning process. All the neural network weights are updated online in real-time, continuously and synchronously. Furthermore, the uniform ultimate bounded stability of the closed-loop system is proved based on Lyapunov approach. Finally, two simulation examples are given to demonstrate the effectiveness of the developed scheme.

关键词： Adaptive dynamic programming (ADP) approximate dynamic programming multiplayer nonzero-sum games neural networks neuro-dynamic programming policy iteration

来源：评论

学校读者我要写书评

暂无评论

A structure property of optimal policies for maintenance problems with safety-critical components

引用

IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING 2008年第3期5卷 519-531页

作者： Xia, Li Zhao, Qianchuan Jia, Qing-Shan Tsinghua Univ Dept Automat Ctr Intelligent & Networked Syst CFINS Beijing 100084 Peoples R China Tsinghua Univ TNLIST Beijing 100084 Peoples R China

The maintenance problem with safety-critical components is significant for the economical benefit of companies. Motivated by a practical asset maintenance project, a new joint replacement maintenance problem is introduced in this paper. The dynamics of the problem are modelled as a Markov decision process, whose action space increases exponentially with the number of safety-critical components in the asset. To deal with the curse of dimensionality, we identify a key property of the optimal solution: the optimal performance can always be achieved in a class of policies which satisfy the so-called shortest-remaining-lifetime-first (SRLF) rule. It reduces the action space from O(2(n)) to O(n), where n is the number of safety-critical components. To further speed up the optimization procedure, some interesting properties of the optimal policy are derived. Combining the SRLF rule and the neuro-dynamic programming (NDP) methodology, we develop an efficient on-line algorithm to optimize this maintenance problem. This algorithm can handle the difficulties of large state space and large action space. Besides the theoretical proof, the optimality and efficiency of the SRLF rule and the properties of the optimal policy are also illustrated by numerical examples. This work can shed some insights to the maintenance problems in a more general situation. Note to Practitioners-Motivated by a practical asset maintenance problem, we introduce a new joint replacement maintenance model in this paper. This problem can be extended to other maintenance problems with safety-critical components and has important economical benefit for companies. During the optimization of joint replacement problems, the action space will grow exponentially with the system size. It makes the action selection very inefficient. This large action space problem is little addressed in the literature. We identify the SRLF rule which can reduce the action space to the linear size of the number of components. This r

关键词： joint replacement maintenance actions Markov decision processes neuro-dynamic programming

来源：评论

学校读者我要写书评

暂无评论

A single front genetic algorithm for parallel multi-objective optimization in dynamic environments

引用

neuroCOMPUTING 2009年第16-18期72卷 3570-3579页

作者： Camara, Mario Ortega, Julio de Toro, Francisco Univ Granada Dept Comp Technol & Architecture E-18071 Granada Spain Univ Granada Dept Signal Theory Telemat & Commun E-18071 Granada Spain

This paper proposes a new parallel evolutionary procedure to solve multi-objective dynamic optimization problems along with some measures to evaluate multi-objective optimization in dynamic environments. These dynamic optimization problems appear in quite different real-world applications with actual socio-economic relevance. In these applications, the objective functions, the constraints, and hence, also the solutions, can change over time and usually demand to be solved online whilst the size of the changes is unknown. Although parallel processing could be very useful in these problems to meet the solution quality requirements and constraints, to date, not many parallel approaches have been reported in the literature. Taking this into account, we introduce a multi-objective optimization procedure for dynamic problems that are based on PSFGA, a parallel evolutionary algorithm previously proposed by us for multi-objective optimization. It uses an island model where a process divides the population among the remaining processes and allows the communication and coordination among the subpopulations in the different islands. The proposed algorithm makes an exclusive use of non-dominating individuals for the selection and variation operator and applies a crowding mechanism to maintain the diversity and the distribution of the solutions in the Pareto front. We also propose a model to understand the benefits of parallel processing in multi-objective problems and the speedup figures obtained in our experiments. (C) 2009 Elsevier B.V. All rights reserved.

关键词： Parallel multi-objective optimization dynamic optimization problems Parallel evolutionary algorithms neuro-dynamic programming

来源：评论

学校读者我要写书评

暂无评论

An analysis of temporal-difference learning with function approximation

引用

IEEE TRANSACTIONS ON AUTOMATIC CONTROL 1997年第5期42卷 674-690页

作者： Tsitsiklis, JN VanRoy, B the Laboratory for Information and Decision Systems Massachusetts Institute of Technology

We discuss the temporal-difference learning algorithm, as applied to approximating the cost-to-go function of an infinite-horizon discounted Markov chain, The algorithm we analyze updates parameters of a linear function approximator online during a single endless trajectory of an irreducible aperiodic Markov chain with a finite or infinite state space, We present a proof of convergence (with probability one), a characterization of the limit of convergence, and a bound on the resulting approximation error, Furthermore, our analysis is based on a new line of reasoning that provides new intuition about the dynamics of temporal difference learning. In addition to proving new and stronger positive results than those previously available, we identify the significance of online updating and potential hazards associated with the use of nonlinear function approximators, First, we prove that divergence may occur when updates are not based on trajectories of the Markov chain, This bet reconciles positive and negative results that have been discussed in the literature, regarding the soundness of temporal-difference learning, Second, we present an example illustrating the possibility of divergence when temporal-difference learning is used in the presence of a nonlinear function approximator.

关键词： dynamic programming function approximation Markov chains neuro-dynamic programming reinforcement learning temporal-difference learning

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：