A major issue in model-free reinforcementlearning is how to efficiently exploit the data collected by an exploration strategy. This is especially important in case of continuous, high dimensional state spaces, since ...
详细信息
ISBN:
(纸本)9781424407064
A major issue in model-free reinforcementlearning is how to efficiently exploit the data collected by an exploration strategy. This is especially important in case of continuous, high dimensional state spaces, since it is impossible to explore such spaces exhaustively. A simple but promising approach is to fix the number of state transitions which are sampled from the underlying markov decision process. For several kernel-based learning algorithms there exist convergence proofs and notable empirical results, if a fixed set of transition instances is used. In this article, we will analyze how function approximators similar to the CMAC-architecture can be combined with this idea. We will show both analytically and empirically the potential power of the CMAC architecture combined with an offline version of Q-learning.
We propose the use of kernel-based methods as underlying function approximator in the least-squares based policy evaluation framework of LSPE(lambda) and LSTD(lambda). In particular we present the Ikernelization' ...
详细信息
ISBN:
(纸本)9781424407064
We propose the use of kernel-based methods as underlying function approximator in the least-squares based policy evaluation framework of LSPE(lambda) and LSTD(lambda). In particular we present the Ikernelization' of model-free LSPE(lambda). The 'kernelization' is computationally made possible by using the subset of regressors approximation, which approximates the kernel using a vastly reduced number of basis functions. The core of our proposed solution is an efficient recursive implementation with automatic supervised selection of the relevant basis functions. The LSPE method is well-suited for optimistic policy iteration and can thus be used in the context of online reinforcementlearning. We use the high-dimensional Octopus benchmark to demonstrate this.
A continuous-time formulation of an adaptive critic design (ACD) is investigated. Connections to the discrete case are made, where backpropagation through time (BPTT) and real-time recurrent learning (RTRL) are preval...
详细信息
A continuous-time formulation of an adaptive critic design (ACD) is investigated. Connections to the discrete case are made, where backpropagation through time (BPTT) and real-time recurrent learning (RTRL) are prevalent. Practical benefits are that this framework fits in well with plant descriptions given by differential equations and that any standard integration routine with adaptive step-size does an adaptive sampling for free. A second-order actor adaptation using Newton's method is established for fast actor convergence for a general plant and critic. Also, a fast critic update for concurrent actor-critic training is introduced to immediately apply necessary adjustments of critic parameters induced by actor updates to keep the Bellman optimality correct to first-order approximation after actor changes. Thus, critic and actor updates may be performed at the same time until some substantial error build up in the Bellman optimality or temporal difference equation, when a traditional critic training needs to be performed and then another interval of concurrent actor-critic training may resume.
learning Automata are shown to be an excellent tool for creating learning multi-agent systems. Most algorithms used in current automata research expect the environment to end in an explicit end-stage. In this end-stag...
详细信息
ISBN:
(纸本)9781424407064
learning Automata are shown to be an excellent tool for creating learning multi-agent systems. Most algorithms used in current automata research expect the environment to end in an explicit end-stage. In this end-stage the rewards are given to the learning automata (i.e. Monte Carlo updating). This is however unfeasible in sequential decision problems with infinite horizon where no such end-stage exists. In this paper we propose a new algorithm based on one-step returns that uses bootstrapping to find good equilibrium paths in multi-stage games.
Viability theory considers the problem of maintaining a system under a set of viability constraints. The main tool for solving viability problems lies in the construction of he hi viability kernel, defined as the set ...
详细信息
ISBN:
(纸本)9781424407064
Viability theory considers the problem of maintaining a system under a set of viability constraints. The main tool for solving viability problems lies in the construction of he hi viability kernel, defined as the set of initial states from which am there exists a trajectory that remains in the set of constraints indefinitely. The theory is very elegant and appears naturally in many applications. Unfortunately, the current numerical approaches suffer from low computational efficiency, which limits the potential range of applications of this domain. In this paper we show that the viability kernel is the zero-level set of a related dynamicprogramming problem, which opens promising research directions for numerical approximation of the viability kernel using tools from approximate dynamicprogramming. We illustrate the approach using k-nearest neighbors on a toy problem in two dimensions and on a complex dynamical model for anaerobic digestion process in four dimensions.
This paper presents the application of an approximate dynamicprogramming (ADP) algorithm to the problem of job releasing and sequencing of a benchmark reentrant manufacturing line (RML). The ADP approach is based on ...
详细信息
ISBN:
(纸本)9781424407064
This paper presents the application of an approximate dynamicprogramming (ADP) algorithm to the problem of job releasing and sequencing of a benchmark reentrant manufacturing line (RML). The ADP approach is based on the SARSA(lambda) algorithm with linear approximation structures that are tuned through a gradient-descent approach. The optimization is performed according to a discounted cost criterion that seeks both the minimization of inventory costs and the maximization of throughput. Simulation experiments are performed by using different approximation architectures to compare the performance of optimal strategies against policies obtained with AM Results from these experiments showed a statistical match in performance between the optimal and the approximated policies obtained through AM Such results also suggest that the applicability of the ADP algorithm presented in this paper may be a promising approach for larger RML systems.
It was shown recently that SVMs are particularly adequate to define action policies to keep a dynamical system inside a given constraint set (in the framework of viability theory). However, the training set of the SVM...
详细信息
ISBN:
(纸本)9781424407064
It was shown recently that SVMs are particularly adequate to define action policies to keep a dynamical system inside a given constraint set (in the framework of viability theory). However, the training set of the SVMs face the dimensionality curse, because it is based on a regular grid of the state space. In this paper, we propose an active learning approach, aiming at decreasing dramatically the training set size, keeping it as close as possible to the final number of support vectors. We use a virtual multi-resolution grid, and some particularities of the problem, to choose very efficient examples to add to the training set. To illustrate the performances of the algorithm, we solve a six-dimensional problem, controlling a bike on a track, problem usually solved using reinforcementlearning techniques.
This paper aims to present an original technique in order to compute the optimal policy of a Markov Decision Problem with continuous state space and discrete decision variables. We propose an extension of the Q-learni...
详细信息
ISBN:
(纸本)9781424407064
This paper aims to present an original technique in order to compute the optimal policy of a Markov Decision Problem with continuous state space and discrete decision variables. We propose an extension of the Q-learning algorithm introduced in 1989 by Watkins for discrete Markov Decision Problems. Our algorithm relies on stochastic approximation and functional estimation, and uses kernels to locally update the Q-functions. We state under mild assumptions a converge theorem for this algorithm. Finally, we illustrate our algorithm by solving two classical problems: the Mountain car Task and the Puddle World Task.
Opposition-Based learning (OBL) is a new scheme in machine intelligence. In this paper, an OBL version Q-learning which exploits opposite quantities to accelerate the learning is used for management of single reservoi...
详细信息
ISBN:
(纸本)9781424407064
Opposition-Based learning (OBL) is a new scheme in machine intelligence. In this paper, an OBL version Q-learning which exploits opposite quantities to accelerate the learning is used for management of single reservoir operations. In this method, an agent takes an action, receives reward, and updates its knowledge in terms of action-value functions. Furthermore, the transition function which is the balance equation in the optimization model determines the next state and updates the action-value function pertinent to opposite action. Two type of opposite actions will be defined. It will be demonstrated that using OBL can significantly improve the efficiency of the operating policy within limited iterations. It is also shown that this technique is more robust than Q-learning.
Cellular Simultaneous Recurrent Neural Networks (SRN) show great promise in solving complex function approximation problems. In particular, approximate dynamicprogramming is an important application area where SRNs h...
详细信息
ISBN:
(纸本)9781424407064
Cellular Simultaneous Recurrent Neural Networks (SRN) show great promise in solving complex function approximation problems. In particular, approximate dynamicprogramming is an important application area where SRNs have Significant potential advantages compared to other approximation methods. learning in SRNs, however, proved to be a notoriously difficult problem, which prevented their broader use. This paper introduces an extended Kalman filter approach to train SRNs. Using the two-dimensional maze navigation problem as a testbed, we illustrate the operation of the method and demonstrate its benefits in generalization and testing performance.
暂无评论