咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >Approximate Modified Policy It... 收藏

Approximate Modified Policy Iteration and its Application to the Game of Tetris

近似修改政策重复和它的申请到 Tetris 的比赛

作     者:Scherrer, Bruno Ghavamzadeh, Mohammad Gabillon, Victor Lesner, Boris Geist, Matthieu 

作者机构:INRIA Nancy Grand Est Team Maia 615 Rue Jardin Bot F-54600 Vandoeuvre Ls Nancy France Adobe Res San Jose CA 95110 USA INRIA Lille San Jose CA 95110 USA INRIA Lille Nord Europe Team SequeL F-59650 Villeneuve Dascq France GeorgiaTech CNRS IMS MaLIS Res Grp Cent Supelec F-57070 Metz France GeorgiaTech CNRS UMI 2958 F-57070 Metz France 

出 版 物:《JOURNAL OF MACHINE LEARNING RESEARCH》 (机器学习研究杂志)

年 卷 期:2015年第16卷第1期

页      面:1629-1676页

核心收录:

学科分类:08[工学] 0811[工学-控制科学与工程] 0812[工学-计算机科学与技术(可授工学、理学学位)] 

主  题:approximate dynamic programming reinforcement learning Markov decision processes finite-sample analysis performance bounds game of tetris 

摘      要:Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form which is used when the state and/or action spaces are large or infinite. In this paper, we propose three implementations of approximate MPI (AMPI) that are extensions of the well-known approximate DP algorithms: fitted-value iteration, fitted-Q iteration, and classification-based policy iteration. We provide error propagation analysis that unify those for approximate policy and value iteration. We develop the finite-sample analysis of these algorithms, which highlights the influence of their parameters. In the classification-based version of the algorithm (CBMPI), the analysis shows that MPI s main parameter controls the balance between the estimation error of the classifier and the overall value function approximation. We illustrate and evaluate the behavior of these new algorithms in the Mountain Car and Tetris problems. Remarkably, in Tetris, CBMPI outperforms the existing DP approaches by a large margin, and competes with the current state-of-the-art methods while using fewer samples.

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分