model-based reinforcement learning (MBrl) methods hold great promise for achieving excellent sample efficiency by fitting a dynamics model to previously observed data and leveraging it for rl or planning. However, the...
详细信息
model-based reinforcement learning (MBrl) methods hold great promise for achieving excellent sample efficiency by fitting a dynamics model to previously observed data and leveraging it for rl or planning. However, the resulting trajectories may diverge from actual-world trajectories due to the accumulation of errors in multi-step model sampling, particularly for longer horizons. This undermines the performance of MBrl and significantly affects sample efficiency. Therefore, we present a trajectory alignment capable of aligning simulated trajectories with their real counterparts from any initial random state and with adaptive length, enabling the preparation of paired real-simulated samples to minimize compounding errors. Additionally, we design a Q-function function to estimate Q values for the paired real-simulated samples. The simulated samples whose Q-value difference from the real ones surpasses a given threshold will be discarded, thus preventing the model from over-fitting to erroneous samples. Experimental results demonstrate that both trajectory alignment and Q-function guided sample filtration contribute to improving policy and sample efficiency. Our method surpasses previous state-of-the-art model-based approaches in both sample efficiency and asymptotic performance across a series of challenging control tasks. The code is open source and available at https://***/duxin0618/***.
暂无评论