General anesthesia is required for patients undergoing surgery as well as for some patients in the intensive care units with acute respiratory distress syndrome. However, most anesthetics affect cardiac and respirator...
详细信息
ISBN:
(纸本)9781479945528
General anesthesia is required for patients undergoing surgery as well as for some patients in the intensive care units with acute respiratory distress syndrome. However, most anesthetics affect cardiac and respiratory functions. Hence, it is important to monitor and control the infusion of anesthetics to meet sedation requirements while keeping patient vital parameters within safe limits. The critical task of anesthesia administration also necessitates that drug dosing be optimal, patient specific, and robust. In this paper, the concept of reinforcementlearning (RL) is used to develop a closed-loop anesthesia controller using the bispectral index (BIS) as a control variable while concurrently accounting for mean arterial pressure (MAP). In particular, the proposed framework uses these two parameters to control propofol infusion rates to regulate the BIS and MAP within a desired range. Specifically, a weighted combination of the error of the BIS and MAP signals is considered in the proposed RL algorithm. This reduces the computational complexity of the RL algorithm and consequently the controller processing time.
This paper investigates learning approaches for discovering fault-tolerant control policies to overcome thruster failures in Autonomous Underwater Vehicles (AUV). The proposed approach is a model-based direct policy s...
详细信息
ISBN:
(纸本)9781479945528
This paper investigates learning approaches for discovering fault-tolerant control policies to overcome thruster failures in Autonomous Underwater Vehicles (AUV). The proposed approach is a model-based direct policy search that learns on an on-board simulated model of the vehicle. When a fault is detected and isolated the model of the AUV is reconfigured according to the new condition. To discover a set of optimal solutions a multi-objective reinforcementlearning approach is employed which can deal with multiple conflicting objectives. Each optimal solution can be used to generate a trajectory that is able to navigate the AUV towards a specified target while satisfying multiple objectives. The discovered policies are executed on the robot in a closed-loop using AUV's state feedback. Unlike most existing methods which disregard the faulty thruster, our approach can also deal with partially broken thrusters to increase the persistent autonomy of the AUV. In addition, the proposed approach is applicable when the AUV either becomes under-actuated or remains redundant in the presence of a fault. We validate the proposed approach on the model of the Girona500 AUV.
This paper investigates the use of policy gradient techniques to approximate the Pareto frontier in Multi-Objective Markov Decision Processes (MOMDPs). Despite the popularity of policy-gradient algorithms and the fact...
详细信息
ISBN:
(纸本)9781479945528
This paper investigates the use of policy gradient techniques to approximate the Pareto frontier in Multi-Objective Markov Decision Processes (MOMDPs). Despite the popularity of policy-gradient algorithms and the fact that gradient-ascent algorithms have been already proposed to numerically solve multi-objective optimization problems, especially in combination with multi-objective evolutionary algorithms, so far little attention has been paid to the use of gradient information to face multi-objective sequential decision problems. Three different Multi-Objective reinforcement-learning (MORL) approaches are here presented. The first two, called radial and Pareto following, start from an initial policy and perform gradient-based policy-search procedures aimed at finding a set of non-dominated policies. Differently, the third approach performs a single gradient-ascent run that, at each step, generates an improved continuous approximation of the Pareto frontier. The parameters of a function that defines a manifold in the policy parameter space are updated following the gradient of some performance criterion so that the sequence of candidate solutions gets as close as possible to the Pareto front. Besides reviewing the three different approaches and discussing their main properties, we empirically compare them with other MORL algorithms on two interesting MOMDPs.
In the stochastic multi-objective multi-armed bandit (or MOMAB), arms generate a vector of stochastic rewards, one per objective, instead of a single scalar reward. As a result, there is not only one optimal arm, but ...
详细信息
ISBN:
(纸本)9781479945528
In the stochastic multi-objective multi-armed bandit (or MOMAB), arms generate a vector of stochastic rewards, one per objective, instead of a single scalar reward. As a result, there is not only one optimal arm, but there is a set of optimal arms (Pareto front) of reward vectors using the Pareto dominance relation and there is a trade-off between finding the optimal arm set (exploration) and selecting fairly or evenly the optimal arms (exploitation). To trade-off between exploration and exploitation, either Pareto knowledge gradient (or Pareto-KG for short), or Pareto upper confidence bound (or Pareto-UCB1 for short) can be used. They combine the KG-policy and UCB1-policy, respectively with the Pareto dominance relation. In this paper, we propose Pareto Thompson sampling that uses Pareto dominance relation to find the Pareto front. We also propose annealing-Pareto algorithm that trades-off between the exploration and exploitation by using a decaying parameter epsilon(t) in combination with Pareto dominance relation. The annealing-Pareto algorithm uses the decaying parameter to explore the Pareto optimal arms and uses Pareto dominance relation to exploit the Pareto front. We experimentally compare Pareto-KG, Pareto-UCB1, Pareto Thompson sampling and the annealing-Pareto algorithms on multi-objective Bernoulli distribution problems and we conclude that the annealing-Pareto is the best performing algorithm.
In adaptivedynamicprogramming, neurocontrol, and reinforcementlearning, the objective is for an agent to learn to choose actions so as to minimize a total cost function. In this paper, we show that when discretized...
详细信息
In adaptivedynamicprogramming, neurocontrol, and reinforcementlearning, the objective is for an agent to learn to choose actions so as to minimize a total cost function. In this paper, we show that when discretized time is used to model the motion of the agent, it can be very important to do clipping on the motion of the agent in the final time step of the trajectory. By clipping, we mean that the final time step of the trajectory is to be truncated such that the agent stops exactly at the first terminal state reached, and no distance further. We demonstrate that when clipping is omitted, learning performance can fail to reach the optimum, and when clipping is done properly, learning performance can improve significantly. The clipping problem we describe affects algorithms that use explicit derivatives of the model functions of the environment to calculate a learning gradient. These include backpropagation through time for control and methods based on dual heuristic programming. However, the clipping problem does not significantly affect methods based on heuristic dynamicprogramming, temporal differences learning, or policy-gradient learning algorithms.
This paper is concerned with a new discrete-time policy iteration adaptivedynamicprogramming (ADP) method for solving the infinite horizon optimal control problem of nonlinear systems. The idea is to use an iterativ...
详细信息
This paper is concerned with a new discrete-time policy iteration adaptivedynamicprogramming (ADP) method for solving the infinite horizon optimal control problem of nonlinear systems. The idea is to use an iterative ADP technique to obtain the iterative control law, which optimizes the iterative performance index function. The main contribution of this paper is to analyze the convergence and stability properties of policy iteration method for discrete-time nonlinear systems for the first time. It shows that the iterative performance index function is nonincreasingly convergent to the optimal solution of the Hamilton-Jacobi-Bellman equation. It is also proven that any of the iterative control laws can stabilize the nonlinear systems. Neural networks are used to approximate the performance index function and compute the optimal control law, respectively, for facilitating the implementation of the iterative ADP algorithm, where the convergence of the weight matrices is analyzed. Finally, the numerical results and analysis are presented to illustrate the performance of the developed method.
This paper is concerned with a new iterative theta-adaptivedynamicprogramming (ADP) technique to solve optimal control problems of infinite horizon discrete-time nonlinear systems. The idea is to use an iterative AD...
详细信息
This paper is concerned with a new iterative theta-adaptivedynamicprogramming (ADP) technique to solve optimal control problems of infinite horizon discrete-time nonlinear systems. The idea is to use an iterative ADP algorithm to obtain the iterative control law which optimizes the iterative performance index function. In the present iterative theta-ADP algorithm, the condition of initial admissible control in policy iteration algorithm is avoided. It is proved that all the iterative controls obtained in the iterative theta-ADP algorithm can stabilize the nonlinear system which means that the iterative theta-ADP algorithm is feasible for implementations both online and offline. Convergence analysis of the performance index function is presented to guarantee that the iterative performance index function will converge to the optimum monotonically. Neural networks are used to approximate the performance index function and compute the optimal control policy, respectively, for facilitating the implementation of the iterative theta-ADP algorithm. Finally, two simulation examples are given to illustrate the performance of the established method.
In this paper, a new iterative adaptivedynamicprogramming (ADP) algorithm is developed to solve optimal control problems for infinite horizon discrete-time nonlinear systems with finite approximation errors. First, ...
详细信息
In this paper, a new iterative adaptivedynamicprogramming (ADP) algorithm is developed to solve optimal control problems for infinite horizon discrete-time nonlinear systems with finite approximation errors. First, a new generalized value iteration algorithm of ADP is developed to make the iterative performance index function converge to the solution of the Hamilton-Jacobi-Bellman equation. The generalized value iteration algorithm permits an arbitrary positive semi-definite function to initialize it, which overcomes the disadvantage of traditional value iteration algorithms. When the iterative control law and iterative performance index function in each iteration cannot accurately be obtained, for the first time a new "design method of the convergence criteria" for the finite-approximation-error-based generalized value iteration algorithm is established. A suitable approximation error can be designed adaptively to make the iterative performance index function converge to a finite neighborhood of the optimal performance index function. Neural networks are used to implement the iterative ADP algorithm. Finally, two simulation examples are given to illustrate the performance of the developed method.
A reinforcementlearning based scheme for optimal switching with an infinite-horizon cost function is briefly proposed in this paper. Several theoretical questions are shown to arise regarding its convergence, optimal...
详细信息
A reinforcementlearning based scheme for optimal switching with an infinite-horizon cost function is briefly proposed in this paper. Several theoretical questions are shown to arise regarding its convergence, optimality of the result, and continuity of the limit function, to be uniformly approximated using parametric function approximators. The main contribution of the paper is providing rigorous answers for the questions, where, sufficient conditions for convergence, optimality, and continuity are provided.
In this paper, a novel data-driven stable iterative adaptivedynamicprogramming (ADP) algorithm is developed to solve optimal temperature control problems for water-gas shift (WGS) reaction systems. According to the ...
详细信息
In this paper, a novel data-driven stable iterative adaptivedynamicprogramming (ADP) algorithm is developed to solve optimal temperature control problems for water-gas shift (WGS) reaction systems. According to the system data, neural networks (NNs) are used to construct the dynamics of the WGS system and solve the reference control, respectively, where the mathematical model of the WGS system is unnecessary. Considering the reconstruction errors of NNs and the disturbances of the system and control input, a new stable iterative ADP algorithm is developed to obtain the optimal control law. The convergence property is developed to guarantee that the iterative performance index function converges to a finite neighborhood of the optimal performance index function. The stability property is developed to guarantee that each of the iterative control laws can make the tracking error uniformly ultimately bounded (UUB). NNs are developed to implement the stable iterative ADP algorithm. Finally, numerical results are given to illustrate the effectiveness of the developed method.
暂无评论