Though not a fundamental pre-requisite to efficient machine learning, insertion of domain knowledge into adaptive virtual agent is nonetheless known to improve learning efficiency and reduce model complexity. Conventi...
详细信息
Though not a fundamental pre-requisite to efficient machine learning, insertion of domain knowledge into adaptive virtual agent is nonetheless known to improve learning efficiency and reduce model complexity. Conventionally, domain knowledge is inserted prior to learning. Despite being effective, such approach may not always be feasible. Firstly, the effect of domain knowledge is assumed and can be inaccurate. Also, domain knowledge may not be available prior to learning. In addition, the insertion of domain knowledge can frame learning and hamper the discovery of more effective knowledge. Therefore, this work advances the use of domain knowledge by proposing to delay the insertion and moderate the effect of domain knowledge to reduce the framing effect while still benefiting from the use of domain knowledge. Using a non-trivial pursuit-evasion problem domain, experiments are first conducted to illustrate the impact of domain knowledge with different degrees of truth. The next set of experiments illustrates how delayed insertion of such domain knowledge can impact learning. The final set of experiments is conducted to illustrate how delaying the insertion and moderating the assumed effect of domain knowledge can ensure the robustness and versatility of reinforcementlearning.
reinforcementlearning algorithms enable an agent to optimize its behavior from interacting with a specific environment. Although some very successful applications of reinforcementlearning algorithms have been develo...
详细信息
reinforcementlearning algorithms enable an agent to optimize its behavior from interacting with a specific environment. Although some very successful applications of reinforcementlearning algorithms have been developed, it is still an open research question how to scale up to large dynamic environments. In this paper we will study the use of reinforcementlearning on the popular arcade video game Ms. Pac-Man. In order to let Ms. Pac-Man quickly learn, we designed particular smart feature extraction algorithms that produce higher-order inputs from the game-state. These inputs are then given to a neural network that is trained using Q-learning. We constructed higher-order features which are relative to the action of Ms. Pac-Man. These relative inputs are then given to a single neural network which sequentially propagates the action-relative inputs to obtain the different Q-values of different actions. The experimental results show that this approach allows the use of only 7 input units in the neural network, while still quickly obtaining very good playing behavior. Furthermore, the experiments show that our approach enables Ms. Pac-Man to successfully transfer its learned policy to a different maze on which it was not trained before.
Swarm robotic systems are a type of multi-robot systems which generally consist of many homogeneous autonomous robots without any type of global controllers. Swarm robotics aims at designing desired collective behavio...
详细信息
Swarm robotic systems are a type of multi-robot systems which generally consist of many homogeneous autonomous robots without any type of global controllers. Swarm robotics aims at designing desired collective behaviors through many interactions with other robots or their environment. Since a robotic swarm is controlled by an emergent way such as a result of self-organization by using robot learning or artificial evolution, no method has been known to grasp the macroscopic collective behavior in a practical sense, according to the best of our knowledge. In this paper, we propose a novel method for analyzing the collective behavior by introducing the concept of behavioral sequence, which stems from ethology. Analysis about behavioral sequence reveals the transition of robot's action from the viewpoint of specialization and helps us to understand the role of subgroups in a robotic swarm. Applying this method, we observe collective behavior in a foraging task of autonomous mobile robots.
This brief presents a novel framework of robust adaptivedynamicprogramming (robust-ADP) aimed at computing globally stabilizing and suboptimal control policies in the presence of dynamic uncertainties. A key strateg...
详细信息
This brief presents a novel framework of robust adaptivedynamicprogramming (robust-ADP) aimed at computing globally stabilizing and suboptimal control policies in the presence of dynamic uncertainties. A key strategy is to integrate ADP theory with techniques in modern nonlinear control with a unique objective of filling up a gap in the past literature of ADP without taking into account dynamic uncertainties. Neither the system dynamics nor the system order are required to be precisely known. As an illustrative example, the computational algorithm is applied to the controller design of a two-machine power system.
We apply diffusion strategies to propose a cooperative reinforcementlearning algorithm, in which agents in a network communicate with their neighbors to improve predictions about their environment. The algorithm is s...
详细信息
In this paper, we aim to solve an infinite-time optimal tracking control problem for a class of discrete-time nonlinear systems using iterative adaptivedynamicprogramming (ADP) algorithm. When the iterative tracking...
详细信息
We apply diffusion strategies to propose a cooperative reinforcementlearning algorithm, in which agents in a network communicate with their neighbors to improve predictions about their environment. The algorithm is s...
详细信息
ISBN:
(纸本)9781479903573
We apply diffusion strategies to propose a cooperative reinforcementlearning algorithm, in which agents in a network communicate with their neighbors to improve predictions about their environment. The algorithm is suitable to learn off-policy even in large state spaces. We provide a mean-square-error performance analysis under constant step-sizes. The gain of cooperation in the form of more stability and less bias and variance in the prediction error, is illustrated in the context of a classical model. We show that the improvement in performance is especially significant when the behavior policy of the agents is different from the target policy under evaluation.
To synthesize fixed-final-time control-constrained optimal controllers for discrete-time nonlinear control-affine systems, a single neural network (NN)-based controller called the Finite-horizon Single Network Adaptiv...
详细信息
To synthesize fixed-final-time control-constrained optimal controllers for discrete-time nonlinear control-affine systems, a single neural network (NN)-based controller called the Finite-horizon Single Network adaptive Critic is developed in this paper. Inputs to the NN are the current system states and the time-to-go, and the network outputs are the costates that are used to compute optimal feedback control. Control constraints are handled through a nonquadratic cost function. Convergence proofs of: 1) the reinforcementlearning-based training method to the optimal solution;2) the training error;and 3) the network weights are provided. The resulting controller is shown to solve the associated time-varying Hamilton-Jacobi-Bellman equation and provide the fixed-final-time optimal solution. Performance of the new synthesis technique is demonstrated through different examples including an attitude control problem wherein a rigid spacecraft performs a finite-time attitude maneuver subject to control bounds. The new formulation has great potential for implementation since it consists of only one NN with single set of weights and it provides comprehensive feedback solutions online, though it is trained offline.
Model-free reinforcementlearning (RL) has become a promising technique for designing a robust dynamic power management (DPM) framework that can cope with variations and uncertainties that emanate from hardware and ap...
详细信息
Model-free reinforcementlearning (RL) has become a promising technique for designing a robust dynamic power management (DPM) framework that can cope with variations and uncertainties that emanate from hardware and application characteristics. Moreover, the potentially significant benefit of performing application-level scheduling as part of the system-level power management should be harnessed. This paper presents an architecture for hierarchical DPM in an embedded system composed of a processor chip and connected I/O devices (which are called system components.) The goal is to facilitate saving in the system component power consumption, which tends to dominate the total power consumption. The proposed (online) adaptive DPM technique consists of two layers: an RL-based component-level local power manager (LPM) and a system-level global power manager (GPM). The LPM performs component power and latency optimization. It employs temporal difference learning on semi-Markov decision process (SMDP) for model-free RL, and it is specifically optimized for an environment in which multiple (heterogeneous) types of applications can run in the embedded system. The GPM interacts with the CPU scheduler to perform effective application-level scheduling, thereby, enabling the LPM to do even more component power optimizations. In this hierarchical DPM framework, power and latency tradeoffs of each type of application can be precisely controlled based on a user-defined parameter. Experiments show that the amount of average power saving is up to 31.1% compared to existing approaches.
This paper gives specific divergence examples of value-iteration for several major reinforcementlearning and adaptivedynamicprogramming algorithms, when using a function approximator for the value function. These d...
详细信息
ISBN:
(纸本)9781467314909
This paper gives specific divergence examples of value-iteration for several major reinforcementlearning and adaptivedynamicprogramming algorithms, when using a function approximator for the value function. These divergence examples differ from previous divergence examples in the literature, in that they are applicable for a greedy policy, i.e. in a "value iteration" scenario. Perhaps surprisingly, with a greedy policy, it is also possible to get divergence for the algorithms TD(1) and Sarsa(1). In addition to these divergences, we also achieve divergence for the adaptivedynamicprogramming algorithms HDP, DHP and GDHP.
暂无评论