版权所有:内蒙古大学图书馆 技术提供:维普资讯• 智图
内蒙古自治区呼和浩特市赛罕区大学西街235号 邮编: 010021
作者机构:ITA Software Cambridge MA 02139 USA
出 版 物:《MACHINE LEARNING》 (机器学习)
年 卷 期:2002年第49卷第2-3期
页 面:233-246页
核心收录:
学科分类:08[工学] 0812[工学-计算机科学与技术(可授工学、理学学位)]
基 金:National Aeronautics and Space Administration, NASA Carnegie Mellon University, CMU
主 题:reinforcement learning temporal difference learning value function approximation linear least-squares methods
摘 要:TD(lambda) is a popular family of algorithms for approximate policy evaluation in large MDPs. TD(lambda) works by incrementally updating the value function after each observed transition. It has two major drawbacks: it may make inefficient use of data, and it requires the user to manually tune a stepsize schedule for good performance. For the case of linear value function approximations and lambda = 0, the Least-Squares TD (LSTD) algorithm of Bradtke and Barto (1996, Machine learning, 22:1-3, 33-57) eliminates all stepsize parameters and improves data efficiency. This paper updates Bradtke and Barto s work in three significant ways. First, it presents a simpler derivation of the LSTD algorithm. Second, it generalizes from lambda = 0 to arbitrary values of lambda;at the extreme of lambda = 1, the resulting new algorithm is shown to be a practical, incremental formulation of supervised linear regression. Third, it presents a novel and intuitive interpretation of LSTD as a model-based reinforcement learning technique.