There has been a growing interest in the study of adaptive/approximate dynamicprogramming (ADP) in recent years. The ADP technique provides a powerful tool to understand and improve the principled technologies of mac...
详细信息
There has been a growing interest in the study of adaptive/approximate dynamicprogramming (ADP) in recent years. The ADP technique provides a powerful tool to understand and improve the principled technologies of machine intelligence system. As one of the ADP algorithms based on adaptive critic neural networks (NNs), the direct heuristic dynamicprogramming (direct HDP) has demonstrated some successful applications in solving realistic engineering control problems. In this study, based on a three-network architecture in which the reinforcement signal is approximated by an additional NN, a novel integrated design method for intensified direct HDP is developed. The new design approach is implemented by using multiple PID neural networks (PIDNNs), which effectively takes into account structural knowledge of system states and control that are usually present in a physical system. By using a Lyapunov stability approach, a uniformly ultimately boundedness (UUB) result is proved for our PIDNNs-based intensified direct HDP learning controller. Furthermore, the learning and control performances of the proposed design is tested using the popular cart-pole example to illustrate the key ideas of this paper.
This paper proposes an on-line near-optimal control scheme based on capabilities of neural networks (NNs), in function approximation, to attain the on-line solution of optimal control problem for nonlinear discrete-ti...
详细信息
This paper proposes an on-line near-optimal control scheme based on capabilities of neural networks (NNs), in function approximation, to attain the on-line solution of optimal control problem for nonlinear discrete-time systems. First, to solve the Hamilton-Jacobi-Bellman (HJB) equation forward-in-time appearing in the optimal control problem, two neural networks are used to approximate the cost function and to compute the optimal control policy, respectively. And then, according to the Bellman's optimality principle and the adaptive technology, the on-line weight updating laws for the critic network and action network are derived, respectively. Further, considering NNs approximative errors, the stability analysis of the closed-loop system is demonstrated by Lyapunov theory. At last, a numerical example is provided to demonstrate the effectiveness of the proposed method.
In this paper, we proposed a new nonlinear tracking controller based on heuristic dynamicprogramming (HDP) with the tracking filter. Specifically, we integrate a goal network into the regular HDP design and provide t...
详细信息
In this paper, we proposed a new nonlinear tracking controller based on heuristic dynamicprogramming (HDP) with the tracking filter. Specifically, we integrate a goal network into the regular HDP design and provide the critic network with detailed internal reward signal to help the value function approximation. The architecture is explicitly explained with the tracking filter, goal network, critic network and action network, respectively. We provide the stability analysis of our proposed controller with Lyapunov approach. It is shown that the filtered tracking errors and the weights estimation errors in neural networks are all uniformly ultimately bounded (UUB) under certain conditions. Finally, we compare our proposed approach with regular HDP approach in virtual reality (VR)/Simulink environment to justify the improved control performance.
In multi-objective problems, it is key to find compromising solutions that balance different objectives. The linear scalarization function is often utilized to translate the multi-objective nature of a problem into a ...
详细信息
In multi-objective problems, it is key to find compromising solutions that balance different objectives. The linear scalarization function is often utilized to translate the multi-objective nature of a problem into a standard, single-objective problem. Generally, it is noted that such as linear combination can only find solutions in convex areas of the Pareto front, therefore making the method inapplicable in situations where the shape of the front is not known beforehand, as is often the case. We propose a non-linear scalarization function, called the Chebyshev scalarization function, as a basis for action selection strategies in multi-objective reinforcementlearning. The Chebyshev scalarization method overcomes the flaws of the linear scalarization function as it can (i) discover Pareto optimal solutions regardless of the shape of the front, i.e. convex as well as non-convex , (ii) obtain a better spread amongst the set of Pareto optimal solutions and (iii) is not particularly dependent on the actual weights used.
We consider the class of online planning algorithms for optimal control, which compared to dynamicprogramming are relatively unaffected by large state dimensionality. We introduce a novel planning algorithm called SO...
详细信息
We consider the class of online planning algorithms for optimal control, which compared to dynamicprogramming are relatively unaffected by large state dimensionality. We introduce a novel planning algorithm called SOOP that works for deterministic systems with continuous states and actions. SOOP is the first method to explore the true solution space, consisting of infinite sequences of continuous actions, without requiring knowledge about the smoothness of the system. SOOP can be used parameter-free at the cost of more model calls, but we also propose a more practical variant tuned by a parameter α, which balances finer discretization with longer planning horizons. Experiments on three problems show SOOP reliably ranks among the best algorithms, fully dominating competing methods when the problem requires both long horizons and fine discretization.
Effective cooperation of multi-robots in unknown environments is essential in many robotic applications, such as environment exploration and target searching. In this paper, a combined hierarchical reinforcement learn...
详细信息
Effective cooperation of multi-robots in unknown environments is essential in many robotic applications, such as environment exploration and target searching. In this paper, a combined hierarchical reinforcementlearning approach, together with a designed cooperation strategy, is proposed for the real-time cooperation of multi-robots in completely unknown environments. Unlike other algorithms that need an explicit environment model or select parameters by trial and error, the proposed cooperation method obtains all the required parameters automatically through learning. By integrating segmental options with the traditional MAXQ algorithm, the cooperation hierarchy is built. In new tasks, the designed cooperation method can control the multi-robot system to complete the task effectively. The simulation results demonstrate that the proposed scheme is able to effectively and efficiently lead a team of robots to cooperatively accomplish target searching tasks in completely unknown environments.
This paper compares three strategies in using reinforcementlearning algorithms to let an artificial agent learn to play the game of Othello. The three strategies that are compared are: learning by self-play, learning...
详细信息
This paper compares three strategies in using reinforcementlearning algorithms to let an artificial agent learn to play the game of Othello. The three strategies that are compared are: learning by self-play, learning from playing against a fixed opponent, and learning from playing against a fixed opponent while learning from the opponent's moves as well. These issues are considered for the algorithms Q-learning, Sarsa and TD-learning. These three reinforcementlearning algorithms are combined with multi-layer perceptrons and trained and tested against three fixed opponents. It is found that the best strategy of learning differs per algorithm. Q-learning and Sarsa perform best when trained against the fixed opponent they are also tested against, whereas TD-learning performs best when trained through self-play. Surprisingly, Q-learning and Sarsa outperform TD-learning against the stronger fixed opponents, when all methods use their best strategy. learning from the opponent's moves as well leads to worse results compared to learning only from the learning agent's own moves.
Though not a fundamental pre-requisite to efficient machine learning, insertion of domain knowledge into adaptive virtual agent is nonetheless known to improve learning efficiency and reduce model complexity. Conventi...
详细信息
Though not a fundamental pre-requisite to efficient machine learning, insertion of domain knowledge into adaptive virtual agent is nonetheless known to improve learning efficiency and reduce model complexity. Conventionally, domain knowledge is inserted prior to learning. Despite being effective, such approach may not always be feasible. Firstly, the effect of domain knowledge is assumed and can be inaccurate. Also, domain knowledge may not be available prior to learning. In addition, the insertion of domain knowledge can frame learning and hamper the discovery of more effective knowledge. Therefore, this work advances the use of domain knowledge by proposing to delay the insertion and moderate the effect of domain knowledge to reduce the framing effect while still benefiting from the use of domain knowledge. Using a non-trivial pursuit-evasion problem domain, experiments are first conducted to illustrate the impact of domain knowledge with different degrees of truth. The next set of experiments illustrates how delayed insertion of such domain knowledge can impact learning. The final set of experiments is conducted to illustrate how delaying the insertion and moderating the assumed effect of domain knowledge can ensure the robustness and versatility of reinforcementlearning.
The electricity market has provided a complex economic environment, and consequently has increased the requirement for advancement of learning methods. In the agent-based modeling and simulation framework of this econ...
详细信息
The electricity market has provided a complex economic environment, and consequently has increased the requirement for advancement of learning methods. In the agent-based modeling and simulation framework of this economic system, the generation company's decision-making is modeled using reinforcementlearning. Existing learning methods that model the generation company's strategic bidding behavior are not adapted to the non-stationary and non-Markovian environment involving multidimensional and continuous state and action spaces. This paper proposes a reinforcementlearning method to overcome these limitations. The proposed method discovers the input space structure through the self-organizing map, exploits learned experience through Roth-Erev reinforcementlearning and explores through the actor critic map. Simulation results from experiments show that the proposed method outperforms Simulated Annealing Q-learning and Variant Roth-Erev reinforcementlearning. The proposed method is a step towards more realistic agent learning in Agent-based Computational Economics.
reinforcementlearning algorithms enable an agent to optimize its behavior from interacting with a specific environment. Although some very successful applications of reinforcementlearning algorithms have been develo...
详细信息
reinforcementlearning algorithms enable an agent to optimize its behavior from interacting with a specific environment. Although some very successful applications of reinforcementlearning algorithms have been developed, it is still an open research question how to scale up to large dynamic environments. In this paper we will study the use of reinforcementlearning on the popular arcade video game Ms. Pac-Man. In order to let Ms. Pac-Man quickly learn, we designed particular smart feature extraction algorithms that produce higher-order inputs from the game-state. These inputs are then given to a neural network that is trained using Q-learning. We constructed higher-order features which are relative to the action of Ms. Pac-Man. These relative inputs are then given to a single neural network which sequentially propagates the action-relative inputs to obtain the different Q-values of different actions. The experimental results show that this approach allows the use of only 7 input units in the neural network, while still quickly obtaining very good playing behavior. Furthermore, the experiments show that our approach enables Ms. Pac-Man to successfully transfer its learned policy to a different maze on which it was not trained before.
暂无评论