检索结果-内蒙古大学图书馆

Average cost temporal-difference learning

AUTOMATICA 1999年第11期35卷 1799-1808页

作者： Tsitsiklis, JN Van Roy, B MIT Informat & Decis Syst Lab Cambridge MA 02139 USA

We propose a variant of temporal-difference learning that approximates average and differential costs of an irreducible aperiodic Markov chain. Approximations are comprised of linear combinations of fixed basis functions whose weights are incrementally updated during a single endless trajectory of the Markov chain. We present a proof of convergence (with probability 1) and a characterization of the limit of convergence. We also provide a bound on the resulting approximation error that exhibits an interesting dependence on the "mixing time" of the Markov chain. The results parallel previous work by the authors, involving approximations of discounted cost-to-go. (C) 1999 Elsevier Science Ltd. All rights reserved.

关键词： dynamic programming learning average cost reinforcement learning neuro-dynamic programming approximation temporal differences

来源：评论

学校读者我要写书评

暂无评论

Stochastic approximation or nonexpansive maps:: Application to Q-learning algorithms

引用

SIAM JOURNAL ON CONTROL AND OPTIMIZATION 2002年第1期41卷 1-22页

作者： Abounadi, J Bertsekas, DP Borkar, V MIT Dept Elect Engn & Comp Sci Cambridge MA 02139 USA Tata Inst Fundamental Res Sch Technol & Comp Sci Bombay 400005 Maharashtra India

We discuss synchronous and asynchronous iterations of the form x(k+1) = x(k) + gamma(k)(h(x(k)) + w(k)), where h is a suitable map and {w(k)} is a deterministic or stochastic sequence satisfying suitable conditions. In particular, in the stochastic case, these are stochastic approximation iterations that can be analyzed using the ODE approach based either on Kushner and Clark's lemma for the synchronous case or on Borkar's theorem for the asynchronous case. However, the analysis requires that the iterates {x(k)} be bounded, a fact which is usually hard to prove. We develop a novel framework for proving boundedness in the deterministic framework, which is also applicable to the stochastic case when the deterministic hypotheses can be verified in the almost sure sense. This is based on scaling ideas and on the properties of Lyapunov functions. We then combine the boundedness property with Borkar's stability analysis of ODEs involving nonexpansive mappings to prove convergence ( with probability 1 in the stochastic case). We also apply our convergence analysis to Q-learning algorithms for stochastic shortest path problems and are able to relax some of the assumptions of the currently available results.

关键词： stochastic approximation Q-learning neuro-dynamic programming

来源：评论

学校读者我要写书评

暂无评论

Analysis and optimization of service availability in an HA cluster with load-dependent machine availability

引用

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2007年第9期18卷 1307-1319页

作者： Ang, Chee-Wei Tham, Chen-Khong Inst Infocomm Res Singapore 119613 Singapore Natl Univ Singapore Dept Elect & Comp Engn Singapore 119260 Singapore

Calculations of service availability of a High-Availability (HA) cluster are usually based on the assumption of load-independent machine availabilities. In this paper, we study the issues and show how the service availabilities can be calculated under the assumption that machine availabilities are load dependent. We present a Markov chain analysis to derive the steady-state service availabilities of a load-dependent machine availability HA cluster. We show that with a load-dependent machine availability, the attained service availability is now policy dependent. After formulating the problem as a Markov Decision Process, we proceed to determine the optimal policy to achieve the maximum service availabilities by using the method of policy iteration. Two greedy assignment algorithms are studied: least load and first derivative length (FDL) based, where least load corresponds to some load balancing algorithms. We carry out the analysis and simulations on two cases of load profiles: In the first profile, a single machine has the capacity to host all services in the HA cluster;in the second profile, a single machine does not have enough capacity to host all services. We show that the service availabilities achieved under the first load profile are the same, whereas the service availabilities achieved under the second load profile are different. Since the service availabilities achieved are different in the second load profile, we proceed to investigate how the distribution of service availabilities across the services can be controlled by adjusting the rewards vector.

关键词： high availability cluster computing Markov chains Markov decision processes dynamic programming neuro-dynamic programming

来源：评论

学校读者我要写书评

暂无评论

Markov decision processes with delays and asynchronous cost collection

引用

IEEE TRANSACTIONS ON AUTOMATIC CONTROL 2003年第4期48卷 568-574页

作者： Katsikopoulos, KV Engelbrecht, SE Univ Massachusetts Dept Mech & Ind Engn Amherst MA 01003 USA Univ Massachusetts Dept Comp Sci Amherst MA 01003 USA

Markov decision processes (MDPs) may involve three types of delays. First, state information, rather than being available instantaneously, may arrive with a delay (observation delay). Second, an action may take effect at a later decision stage rather than immediately (action delay). Third, the cost induced by an action may be collected after a number of stages (cost delay). We derive two results, one for constant and one for random delays, for reducing an MDP with delays to an MDP without delays, which, differs only in the size of the state space. The results are based on the intuition that costs may be collected asynchronously, i.e., at a stage other than the one in which they are induced, as long as they are discounted properly.

关键词： asynchrony delays Markov decision processes (MDP's) neuro-dynamic programming

来源：评论

学校读者我要写书评

暂无评论

New rollout algorithms for combinatorial optimization problems

引用

OPTIMIZATION METHODS & SOFTWARE 2002年第4期17卷 627-654页

作者： Guerriero, F Mancini, M Musmanno, R Univ Calabria Dipartimento Elettron Informat & Sistemist I-87030 Arcavacata Di Rende CS Italy

Rollout algorithms are new computational approaches used to determine near-optimal solutions for deterministic and stochastic combinatorial optimization problems. They are built on a generic base heuristic with the aim to construct another hopefully improved heuristic. However, rollout algorithms can be very expensive from the computational point of view, so their use for practical applications can be limited. In this article, we propose modified versions of the rollout algorithms to solve deterministic optimization problems, defined in such a way to limit the computational cost, without worsening the quality of the final approximate solution obtained.

关键词： combinatorial optimization problems rollout algorithms neuro-dynamic programming local search methods construction heuristics

来源：评论

学校读者我要写书评

暂无评论

ADP-based optimal sensor scheduling for target tracking in energy harvesting wireless sensor networks

引用

NEURAL COMPUTING & APPLICATIONS 2016年第6期27卷 1543-1551页

作者： Song, Ruizhuo Wei, Qinglai Xiao, Wendong Univ Sci & Technol Beijing Sch Automat & Elect Engn Beijing 100083 Peoples R China Chinese Acad Sci Inst Automat State Key Lab Management & Control Complex Syst Beijing 100190 Peoples R China

This paper proposes a novel sensor scheduling scheme based on adaptive dynamic programming, which makes the sensor energy consumption and tracking error optimal over the system operational horizon for wireless sensor networks with solar energy harvesting. Neural network is used to model the solar energy harvesting. Kalman filter estimation technology is employed to predict the target location. A performance index function is established based on the energy consumption and tracking error. Critic network is developed to approximate the performance index function. The presented method is proven to be convergent. Numerical example shows the effectiveness of the proposed approach.

关键词： Adaptive critic designs Adaptive dynamic programming Approximate dynamic programming neuro-dynamic programming Neural networks Wireless sensor networks Scheduling

来源：评论

学校读者我要写书评

暂无评论

Valuation of American options via basis functions

引用

IEEE TRANSACTIONS ON AUTOMATIC CONTROL 2004年第3期49卷 374-385页

作者： Lai, TL Wong, SPS Stanford Univ Dept Stat Stanford CA 94305 USA Hong Kong Univ Sci & Technol Dept Informat & Syst Management Hong Kong Hong Kong Peoples R China

After a brief review of recent developments in the pricing and hedging of American options, this paper modifies the basis function approach to adaptive control and neuro-dynamic programming, and applies it to develop: 1) nonparametric pricing formulas for actively traded American options and 2) simulation-based optimization strategies for complex over-the-counter options, whose optimal stopping problems are prohibitively difficult to solve numerically by standard backward induction algorithms because of the curse of dimensionality. An important issue in this approach is the choice of basis functions, for which some guidelines and their underlying theory are provided.

关键词： function approximation neuro-dynamic programming optimal stopping option pricing spline basis

来源：评论

学校读者我要写书评

暂无评论

Discrete-Time Deterministic Q-Learning: A Novel Convergence Analysis

引用

IEEE TRANSACTIONS ON CYBERNETICS 2017年第5期47卷 1224-1237页

作者： Wei, Qinglai Lewis, Frank L. Sun, Qiuye Yan, Pengfei Song, Ruizhuo Chinese Acad Sci Inst Automat State Key Lab Management & Control Complex Syst Beijing 100190 Peoples R China Univ Texas Arlington UTA Res Inst Arlington TX 76118 USA Northeastern Univ Shenyang 110036 Peoples R China Northeastern Univ Sch Informat Sci & Engn Shenyang 110036 Peoples R China Univ Sci & Technol Beijing Sch Automat & Elect Engn Beijing 100083 Peoples R China

In this paper, a novel discrete-time deterministic Q-learning algorithm is developed. In each iteration of the developed Q-learning algorithm, the iterative Q function is updated for all the state and control spaces, instead of updating for a single state and a single control in traditional Q-learning algorithm. A new convergence criterion is established to guarantee that the iterative Q function converges to the optimum, where the convergence criterion of the learning rates for traditional Q-learning algorithms is simplified. During the convergence analysis, the upper and lower bounds of the iterative Q function are analyzed to obtain the convergence criterion, instead of analyzing the iterative Q function itself. For convenience of analysis, the convergence properties for undiscounted case of the deterministic Q-learning algorithm are first developed. Then, considering the discounted factor, the convergence criterion for the discounted case is established. Neural networks are used to approximate the iterative Q function and compute the iterative control law, respectively, for facilitating the implementation of the deterministic Q-learning algorithm. Finally, simulation results and comparisons are given to illustrate the performance of the developed algorithm.

关键词： Adaptive critic designs adaptive dynamic programming (ADP) approximate dynamic programming neural networks (NNs) neuro-dynamic programming optimal control Q-learning

来源：评论

学校读者我要写书评

暂无评论

Continuous-Time Time-Varying Policy Iteration

引用

IEEE TRANSACTIONS ON CYBERNETICS 2020年第12期50卷 4958-4971页

作者： Wei, Qinglai Liao, Zehua Yang, Zhanyu Li, Benkai Liu, Derong Chinese Acad Sci Inst Automat State Key Lab Management & Control Complex Syst Beijing 100190 Peoples R China Univ Chinese Acad Sci Sch Artificial Intelligence Beijing 100049 Peoples R China Guangdong Univ Technol Sch Automat Guangzhou 510006 Peoples R China

A novel policy iteration algorithm, called the continuous-time time-varying (CTTV) policy iteration algorithm, is presented in this paper to obtain the optimal control laws for infinite horizon CTTV nonlinear systems. The adaptive dynamic programming (ADP) technique is utilized to obtain the iterative control laws for the optimization of the performance index function. The properties of the CTTV policy iteration algorithm are analyzed. Monotonicity, convergence, and optimality of the iterative value function have been analyzed, and the iterative value function can be proven to monotonically converge to the optimal solution of the Hamilton-Jacobi-Bellman (HJB) equation. Furthermore, the iterative control law is guaranteed to be admissible to stabilize the nonlinear systems. In the implementation of the presented CTTV policy algorithm, the approximate iterative control laws and iterative value function are obtained by neural networks. Finally, the numerical results are given to verify the effectiveness of the presented method.

关键词： Optimal control Nonlinear systems Time-varying systems Mathematical model dynamic programming Approximation algorithms Iterative algorithms Adaptive critic designs adaptive dynamic programming (ADP) neuro-dynamic programming nonlinear systems optimal control policy iteration

来源：评论

学校读者我要写书评

暂无评论

A partial policy iteration ADP algorithm for nonlinear neuro-optimal control with discounted total reward

引用

neuroCOMPUTING 2021年 424卷 23-34页

作者： Liang, Mingming Wei, Qinglai Guangdong Univ Technol Sch Automat Guangzhou 510006 Peoples R China Chinese Acad Sci Inst Automat State Key Lab Management & Control Complex Syst Beijing 100190 Peoples R China

This paper constructs a partial policy iteration adaptive dynamic programming (ADP) algorithm to solve the optimal control problem of nonlinear systems with discounted total reward. Compared with traditional policy iteration ADP algorithm, the approach updates the iterative control law only in a local region of the global system state space. With the benefit of this feature, the overall computational burden at each iteration for processing units can be significantly reduced. Hence, this feature enables our algorithm to be successfully executed on low-performance devices such as smartphones, smartwatches and the Internet of Things (IoT) objects. We provide the convergency analysis to show that the generated sequence of value functions is monotonically nonincreasing and can finally reach a local optimum. In addition, the corresponding local policy space is developed theoretically for the first time. Besides, when the sequence of the local system state spaces is chosen properly, we prove that the developed algorithm is capable of finding the global optimal performance index function for the nonlinear systems. Finally, we present a numerical simulation to demonstrate the effectiveness of the proposed algorithm. (c) 2020 Elsevier B.V. All rights reserved.

关键词： Adaptive critic designs Adaptive dynamic programming Policy iteration Neural networks neuro-dynamic programming Nonlinear systems Optimal control

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：