Effective cooperation of multi-robots in unknown environments is essential in many robotic applications, such as environment exploration and target searching. In this paper, a combined hierarchical reinforcement learn...
详细信息
ISBN:
(纸本)9781467359252
Effective cooperation of multi-robots in unknown environments is essential in many robotic applications, such as environment exploration and target searching. In this paper, a combined hierarchical reinforcementlearning approach, together with a designed cooperation strategy, is proposed for the real-time cooperation of multi-robots in completely unknown environments. Unlike other algorithms that need an explicit environment model or select parameters by trial and error, the proposed cooperation method obtains all the required parameters automatically through learning. By integrating segmental options with the traditional MAXQ algorithm, the cooperation hierarchy is built. In new tasks, the designed cooperation method can control the multi-robot system to complete the task effectively. The simulation results demonstrate that the proposed scheme is able to effectively and efficiently lead a team of robots to cooperatively accomplish target searching tasks in completely unknown environments.
We consider the class of online planning algorithms for optimal control, which compared to dynamicprogramming are relatively unaffected by large state dimensionality. We introduce a novel planning algorithm called SO...
详细信息
ISBN:
(纸本)9781467359252
We consider the class of online planning algorithms for optimal control, which compared to dynamicprogramming are relatively unaffected by large state dimensionality. We introduce a novel planning algorithm called SOOP that works for deterministic systems with continuous states and actions. SOOP is the first method to explore the true solution space, consisting of infinite sequences of continuous actions, without requiring knowledge about the smoothness of the system. SOOP can be used parameter-free at the cost of more model calls, but we also propose a more practical variant tuned by a parameter a, which balances finer discretization with longer planning horizons. Experiments on three problems show SOOP reliably ranks among the best algorithms, fully dominating competing methods when the problem requires both long horizons and fine discretization.
Though not a fundamental pre-requisite to efficient machine learning, insertion of domain knowledge into adaptive virtual agent is nonetheless known to improve learning efficiency and reduce model complexity. Conventi...
详细信息
ISBN:
(纸本)9781467359252
Though not a fundamental pre-requisite to efficient machine learning, insertion of domain knowledge into adaptive virtual agent is nonetheless known to improve learning efficiency and reduce model complexity. Conventionally, domain knowledge is inserted prior to learning. Despite being effective, such approach may not always be feasible. Firstly, the effect of domain knowledge is assumed and can be inaccurate. Also, domain knowledge may not be available prior to learning. In addition, the insertion of domain knowledge can frame learning and hamper the discovery of more effective knowledge. Therefore, this work advances the use of domain knowledge by proposing to delay the insertion and moderate the effect of domain knowledge to reduce the framing effect while still benefiting from the use of domain knowledge. Using a non-trivial pursuit-evasion problem domain, experiments are first conducted to illustrate the impact of domain knowledge with different degrees of truth. The next set of experiments illustrates how delayed insertion of such domain knowledge can impact learning. The final set of experiments is conducted to illustrate how delaying the insertion and moderating the assumed effect of domain knowledge can ensure the robustness and versatility of reinforcementlearning.
Traditional reinforcementlearning algorithms, such as Q-learning, Q(lambda), Sarsa, and Sarsa(lambda), update the action value function using temporal difference (TD) error, which is computed by the last action value...
详细信息
ISBN:
(纸本)9781467359252
Traditional reinforcementlearning algorithms, such as Q-learning, Q(lambda), Sarsa, and Sarsa(lambda), update the action value function using temporal difference (TD) error, which is computed by the last action value function. From the perspective of the TD error, and with respect to the problems of low efficiency and slow convergence of the traditional Sarsa(lambda) algorithm, this paper defines the nth order TD Error, applies it in the traditional Sarsa(lambda) algorithm, and develops a fast Sarsa(lambda) algorithm based on the 2nd order TD Error. The algorithm adjusts the Q value with the second-order TD Error and broadcasts the TD Error into the whole state-action space, which speeds up the convergence of the algorithm. This paper also analyzes the convergence rate, and under the condition of one-step update, the results show that the number of iteration depends primarily on gamma, epsilon. Finally, using the proposed algorithm on the traditional reinforcementlearning problems, the results show that the algorithm has both a faster convergence rate and better convergence performance.
reinforcementlearning algorithms enable an agent to optimize its behavior from interacting with a specific environment. Although some very successful applications of reinforcementlearning algorithms have been develo...
详细信息
ISBN:
(纸本)9781467359252
reinforcementlearning algorithms enable an agent to optimize its behavior from interacting with a specific environment. Although some very successful applications of reinforcementlearning algorithms have been developed, it is still an open research question how to scale up to large dynamic environments. In this paper we will study the use of reinforcementlearning on the popular arcade video game Ms. Pac-Man. In order to let Ms. Pac-Man quickly learn, we designed particular smart feature extraction algorithms that produce higher-order inputs from the game-state. These inputs are then given to a neural network that is trained using Q-learning. We constructed higher-order features which are relative to the action of Ms. Pac-Man. These relative inputs are then given to a single neural network which sequentially propagates the action-relative inputs to obtain the different Q-values of different actions. The experimental results show that this approach allows the use of only 7 input units in the neural network, while still quickly obtaining very good playing behavior. Furthermore, the experiments show that our approach enables Ms. Pac-Man to successfully transfer its learned policy to a different maze on which it was not trained before.
This paper presents a novel approach for constructing basis functions in approximate dynamicprogramming (ADP) through the locally linear embedding (LLE) process. It considers the experience (sample) data as a high-di...
详细信息
In this paper, we present a new adaptivedynamicprogramming approach by integrating a reference network that provides an internal goal representation to help the systems learning and optimization. Specifically, we bu...
详细信息
In this paper, we present a new adaptivedynamicprogramming approach by integrating a reference network that provides an internal goal representation to help the systems learning and optimization. Specifically, we build the reference network on top of the critic network to form a dual critic network design that contains the detailed internal goal representation to help approximate the value function. This internal goal signal, working as the reinforcement signal for the critic network in our design, is adaptively generated by the reference network and can also be adjusted automatically. In this way, we provide an alternative choice rather than crafting the reinforcement signal manually from prior knowledge. In this paper, we adopt the online action-dependent heuristic dynamicprogramming (ADHDP) design and provide the detailed design of the dual critic network structure. Detailed Lyapunov stability analysis for our proposed approach is presented to support the proposed structure from a theoretical point of view. Furthermore, we also develop a virtual reality platform to demonstrate the real-time simulation of our approach under different disturbance situations. The overall adaptivelearning performance has been tested on two tracking control benchmarks with a tracking filter. For comparative studies, we also present the tracking performance with the typical ADHDP, and the simulation results justify the improved performance with our approach.
Goal representation heuristic dynamicprogramming (GrHDP) is proposed in this paper to demonstrate online learning in the Markov decision process. In addition to the (external) reinforcement signal in literature, we d...
详细信息
Goal representation heuristic dynamicprogramming (GrHDP) is proposed in this paper to demonstrate online learning in the Markov decision process. In addition to the (external) reinforcement signal in literature, we develop an adaptively internal goal/reward representation for the agent with the proposed goal network. Specifically, we keep the actor-critic design in heuristic dynamicprogramming (HDP) and include a goal network to represent the internal goal signal, to further help the value function approximation. We evaluate our proposed GrHDP algorithm on two 2-D maze navigation problems, and later on one 3-D maze navigation problem. Compared to the traditional HDP approach, the learning performance of the agent is improved with our proposed GrHDP approach. In addition, we also include the learning performance with two other reinforcementlearning algorithms, namely Sarsa(lambda) and Q-learning, on the same benchmarks for comparison. Furthermore, in order to demonstrate the theoretical guarantee of our proposed method, we provide the characteristics analysis toward the convergence of weights in neural networks in our GrHDP approach.
In this paper a new method for designing and implementing coordinated wide area controller architecture is presented and tested using real-time digital simulation on a benchmark two area power system model for improve...
详细信息
In this paper a new method for designing and implementing coordinated wide area controller architecture is presented and tested using real-time digital simulation on a benchmark two area power system model for improved power system dynamic stability. The algorithm is an optimal Wide Area System-Centric Controller and Observer (WASCCO) based on reinforcement and temporal difference learning which allows the system to learn from interaction and predict future states. The controller design uses a powerful technique of the adaptive critic design (ACD) family called dual heuristic programming (DHP). The DHP controllers training and testing are implemented on the Innovative Integration Picolo card consisting of the TMS320C28335 processor. The main advantage of this design is its ability to learn from the past using eligibility traces and predict the optimal trajectory through temporal difference learning in the format of Receding Horizon Control(RHC). Results on a two area system provides better response compared to conventional schemes.
暂无评论