This paper deals with the flnite-horizon optimal tracking control for a class of discrete-time nonlinear systems using the iterative adaptive dynamicprogramming(ADP) ***,the optimal tracking problem is converted into...
详细信息
This paper deals with the flnite-horizon optimal tracking control for a class of discrete-time nonlinear systems using the iterative adaptive dynamicprogramming(ADP) ***,the optimal tracking problem is converted into designing a flnite-horizon optimal regulator for the tracking error ***,with convergence analysis in terms of cost function and control law,the iterative ADP algorithm via heuristic dynamicprogramming(HDP) technique is introduced to obtain the flnite-horizon optimal tracking controller which makes the cost function close to its optimal value within an e-error ***, three neural networks are used to implement the algorithm,which aims at approximating the cost function,the control law,and the error dynamics,*** last,an example is included to demonstrate the effectiveness of the proposed approach.
This paper studies Merton's portfolio optimization problem with proportional transaction costs in a discrete-time finite horizon. Facing short-sale and borrowing constraints, investors have access to a risk-free a...
详细信息
This paper studies Merton's portfolio optimization problem with proportional transaction costs in a discrete-time finite horizon. Facing short-sale and borrowing constraints, investors have access to a risk-free asset and multiple risky assets whose returns follow a multivariate geometric Brownian motion. Lower and upper bounds for optimal solutions up to the problem with 20 risky assets and 40 investment periods are computed. Three lower bounds are proposed: the value function optimization (VF), the hyper-sphere and the hyper-cube policy parameterizations (HS and HC). VF attacks the conundrums in traditional value function iteration for high-dimensional dynamic programs with continuous decision and state spaces. HS and HC respectively approximate the geometry of the trading policy in the high-dimensional state space by two surfaces. To evaluate lower bounds, two new upper bounds are provided via a duality method based on a new auxiliary problem (OMG and OMG2). Compared with existing methods across various suites of parameters, new methods lucidly show superiority. The three lower bound methods always achieve higher utilities, HS and HC cut run times by a factor of 100, and OMG and OMG2 mostly provide tighter upper bounds. In addition, how the no-trading region characterizing the optimal policy deforms when short-sale and borrowing constraints bind is investigated.
Value function approximation has a central role in approximate dynamic programming (ADP) to overcome the so-called curse of dimensionality associated to real stochastic processes. In this regard, we propose a novel Le...
详细信息
Value function approximation has a central role in approximate dynamic programming (ADP) to overcome the so-called curse of dimensionality associated to real stochastic processes. In this regard, we propose a novel Least-Squares Temporal Difference (LSTD) based method: the ``Multi-trajectory Greedy LSTD'' (MG-LSTD). It is an exploration-enhanced recursive LSTD algorithm with the policy improvement embedded within the LSTD iterations. It makes use of multi-trajectories Monte Carlo simulations in order to enhance the system state space exploration. This method is applied for solving resource allocation problems modeled via a constrained Stochastic dynamicprogramming (SDP) based framework. In particular, such problems are formulated as a set of parallel Birth-Death Processes (BDPs). Some operational scenarios are defined and solved to show the effectiveness of the proposed approach. Finally, we provide some experimental evidence on the MG-LSTD algorithm convergence properties in function of its key-parameters. (C) 2020 Elsevier Ltd. All rights reserved.
The accessibility and efficiency of outpatient clinic operations are largely affected by appointment schedules. Clinical scheduling is a process of assigning physician appointment times to sequentially calling patient...
详细信息
approximate dynamic programming (ADP) aims to obtain an approximate numerical solution to the discrete-time Hamilton-Jacobi-Bellman (HJB) equation. Heuristic dynamicprogramming (HDP) is a two-stage iterative scheme o...
详细信息
approximate dynamic programming (ADP) aims to obtain an approximate numerical solution to the discrete-time Hamilton-Jacobi-Bellman (HJB) equation. Heuristic dynamicprogramming (HDP) is a two-stage iterative scheme of ADP by separating the HJB equation into two equations, one for the value function and another for the policy function, which are referred to as the critic and the actor, respectively. Previous ADP implementations have been limited by the choice of function approximator, which requires significant prior domain knowledge or a large number of parameters to be fitted. However, recent advances in deep learning brought by the computer science community enable the use of deep neural networks (DNN) to approximate high-dimensional nonlinear functions without prior domain knowledge. Motivated by this, we examine the potential of DNNs as function approximators of the critic and the actor. In contrast to the infinite-horizon optimal control problem, the critic and the actor of the finite horizon optimal control (FHOC) problem are time-varying functions and have to satisfy a boundary condition. DNN structure and training algorithm suitable for FHOC are presented. Illustrative examples are provided to demonstrate the validity of the proposed method.
This paper concerns with a novel generalized policy iteration (GPI) algorithm with approximation errors. Approximation errors are explicitly considered in the GPI algorithm. The properties of the stable GPI algorithm ...
详细信息
This paper concerns with a novel generalized policy iteration (GPI) algorithm with approximation errors. Approximation errors are explicitly considered in the GPI algorithm. The properties of the stable GPI algorithm with approximation errors are analyzed. The convergence of the developed algorithm is established to show that the iterative value function is convergent to a finite neighborhood of the optimal performance index function. Finally, numerical examples and comparisons are presented.
In this paper, we mainly propose an online learning method for adaptive traffic signal control in a multi-intersection system. The method uses approximate dynamic programming (ADP) to achieve a near-optimal solution o...
详细信息
ISBN:
(纸本)9781509001644
In this paper, we mainly propose an online learning method for adaptive traffic signal control in a multi-intersection system. The method uses approximate dynamic programming (ADP) to achieve a near-optimal solution of the signal optimization in a distributed network, which is modeled in a microscopic way. The traffic network loading model and traffic signal control model are presented to serve as the basis of discrete-time control environment. The learning process of linear function approximation in ADP approach adopts the tunable parameters of the traffic states, including the vehicle queue length and the signal indication. ADP overcomes the computational complexity, which usually appears in large-scale problems solved by exact algorithms, such as dynamicprogramming. Moreover, the proposed adaptive phase sequence (APS) mode improves the performance by comparing with other control methods. The results in simulation show that our method performs quite well for adaptive traffic signal control problem.
Modified policy iteration (MPI) is a dynamicprogramming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its ...
详细信息
Modified policy iteration (MPI) is a dynamicprogramming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form which is used when the state and/or action spaces are large or infinite. In this paper, we propose three implementations of approximate MPI (AMPI) that are extensions of the well-known approximate DP algorithms: fitted-value iteration, fitted-Q iteration, and classification-based policy iteration. We provide error propagation analysis that unify those for approximate policy and value iteration. We develop the finite-sample analysis of these algorithms, which highlights the influence of their parameters. In the classification-based version of the algorithm (CBMPI), the analysis shows that MPI's main parameter controls the balance between the estimation error of the classifier and the overall value function approximation. We illustrate and evaluate the behavior of these new algorithms in the Mountain Car and Tetris problems. Remarkably, in Tetris, CBMPI outperforms the existing DP approaches by a large margin, and competes with the current state-of-the-art methods while using fewer samples.
In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinite-horizon Markov decision processes. DPP is an incremental algorithm tha...
详细信息
In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinite-horizon Markov decision processes. DPP is an incremental algorithm that forces a gradual change in policy update. This allows us to prove finite-iteration and asymptotic l∞-norm performance-loss bounds in the presence of approximation/ estimation error which depend on the average accumulated error as opposed to the standard bounds which are expressed in terms of the supremum of the errors. The dependency on the average error is important in problems with limited number of samples per iteration, for which the average of the errors can be significantly smaller in size than the supremum of the errors. Based on these theoretical results, we prove that a sampling-based variant of DPP (DPP-RL) asymptotically converges to the optimal policy. Finally, we illustrate numerically the applicability of these results on some benchmark problems and compare the performance of the approximate variants of DPP with some existing reinforcement learning (RL) methods.
暂无评论