In traditional adaptive dynamic programming (ADP), only one step estimate is considered for training process, Thus, learning efficiency is lower. If more steps estimates are included, learning process will be speed up...
详细信息
In traditional adaptive dynamic programming (ADP), only one step estimate is considered for training process, Thus, learning efficiency is lower. If more steps estimates are included, learning process will be speed up. Eligibility traces record the past and current gradients of estimation. It can be used to work with ADP for speeding up learning. In this paper, Heuristic dynamicprogramming (HDP) which is a typical structure of ADP is considered. An algorithm, HDP(lambda), integrating HDP with eligibility traces is presented. The algorithm is illustrated from both forward view and back view for clear comprehension. Equivalency of two views is analyzed. Furthermore, differences between HDP and HDP(lambda) are considered from both aspects of theoretic analysis and simulation results. The problem of balancing a pendulum robot (pendubot) is adopted as a benchmark. The results indicate that compared to HDP, HDP(lambda) shows higher convergence rate and training efficiency.
Today a massive amount of information available on the WWW often makes searching for information of interest a long and tedious task. Chasing hyperlinks to rind relevant information may be daunting. To overcome such a...
详细信息
Today a massive amount of information available on the WWW often makes searching for information of interest a long and tedious task. Chasing hyperlinks to rind relevant information may be daunting. To overcome such a problem, a learning system, cognizant of a user's interests, can be employed to automatically search for and retrieve relevant information by following appropriate hyperlinks. In this paper, we describe the design of such a learning system for automated Web navigation using adaptive dynamic programming methods. To improve the performance of the learning system, we introduce the notion of multiple model-based learning agents operating in parallel, and describe methods for combining their models. Experimental results on the WWW navigation problem are presented to indicate that combining multiple learning agents, relying on user feedback, is a promising direction to improve learning speed in automated WWW navigation.
In a companion paper (Godfrey and Powell 2002) we introduced an adaptive dynamic programming algorithm for stochastic dynamic resource allocation problems, which arise in the context of logistics and distribution, fle...
详细信息
In a companion paper (Godfrey and Powell 2002) we introduced an adaptive dynamic programming algorithm for stochastic dynamic resource allocation problems, which arise in the context of logistics and distribution, fleet management, and other allocation problems. The method depends on estimating separable nonlinear approximations of value functions, using a dynamicprogramming framework. That paper considered only the case in which the time to complete an action was always a single time period. Experiments with this technique quickly showed that when the basic algorithm was applied to problems with multiperiod travel times, the results were very poor. In this paper, we illustrate why this behavior arose, and propose a modified algorithm that addresses the issue. Experimental work demonstrates that the modified algorithm works on problems with multiperiod travel times, with results that are almost as good as the original algorithm applied to single period travel times.
dynamicprogramming offers an exact, general solution method for completely known sequential decision problems, formulated as Markov Decision Processes (MDP), with a finite number of states. Recently, there has been a...
详细信息
ISBN:
(纸本)3540673547
dynamicprogramming offers an exact, general solution method for completely known sequential decision problems, formulated as Markov Decision Processes (MDP), with a finite number of states. Recently, there has been a great amount of interest in the adaptive version of the problem, where the task to be solved is not completely known a priori. In such a case, an agent has to acquire the necessary knowledge through learning, while simultaneously solving the optimal control or decision problem. A large variety of algorithms, variously known as adaptive dynamic programming (ADP) or Reinforcement Learning (RL), has been proposed in the literature. However, almost invariably such algorithms suffer from slow convergence in terms of the number of experiments needed. In this paper Re investigate how the learning speed can be considerably improved by exploiting and combining knowledge accumulated by multiple agents. These agents operate in the same task environment but follow possibly different trajectories. We discuss methods of combining the knowledge structures associated with the multiple agents and different strategies (with varying overheads) for knowledge communication between agents. Results of simulation experiments are also presented to indicate that combining multiple learning agents is a promising direction to improve learning speed. The method also performs significantly better than some of the fastest MDP learning algorithms such as the prioritized sweeping.
暂无评论