In this study, we illustrate a real-time approximate dynamic programming (RTADP) method for solving multistage capacity decision problems in a stochastic manufacturing environment, by using an exemplary three-stage ma...
详细信息
In this study, we illustrate a real-time approximate dynamic programming (RTADP) method for solving multistage capacity decision problems in a stochastic manufacturing environment, by using an exemplary three-stage manufacturing system with recycle. The system is a moderate size queuing network, which experiences stochastic variations in demand and product yield. The dynamic capacity decision problem is formulated as a Markov decision process (MDP). The proposed RTADP method starts with a set of heuristics and learns a superior quality solution by interacting with the stochastic system via simulation. The curse-of-dimensionality associated with DP methods is alleviated by the adoption of several notions including "evolving set of relevant states," for which the value function table is built and updated, "adaptive action set" for keeping track of attractive action candidates, and "nonparametric k nearest neighbor averager" for value function approximation. The performance of the learned solution is evaluated against (1) an "ideal" Solution derived using a mixed integer programming (MIP) formulation, which assumes full knowledge of future realized values of the stochastic variables (2) a myopic heuristic solution, and (3) a sample path based rolling horizon MIP solution. The policy learned through the RTADP method turned out to be superior to polices of 2 and 3. (C) 2010 Wiley Periodicals, Inc. Naval Research Logistics 57: 211-224, 2010
We consider the problem of temporal fair scheduling of queued data transmissions in wireless heterogeneous networks. We deal with both the throughput maximization problem and the delay minimization problem. Taking fai...
详细信息
We consider the problem of temporal fair scheduling of queued data transmissions in wireless heterogeneous networks. We deal with both the throughput maximization problem and the delay minimization problem. Taking fairness constraints and the data arrival queues into consideration, we formulate the transmission scheduling problem as a Markov decision process (MDP) with fairness constraints. We study two categories of fairness constraints, namely temporal fairness and utilitarian fairness. We consider two criteria: infinite horizon expected total discounted reward and expected average reward. Applying the dynamicprogramming approach, we derive and prove explicit optimality equations for the above constrained MDPs, and give corresponding optimal fair scheduling policies based on those equations. A practical stochastic-approximation-type algorithm is applied to calculate the control parameters online in the policies. Furthermore, we develop a novel approximation method-temporal fair rollout-to achieve a tractable computation. Numerical results show that the proposed scheme achieves significant performance improvement for both throughput maximization and delay minimization problems compared with other existing schemes.
In Part I of this tutorial, we provided a canonical modeling framework for sequential, stochastic optimization (control) problems. A major feature of this framework is a clear separation of the process of modeling a p...
详细信息
In Part I of this tutorial, we provided a canonical modeling framework for sequential, stochastic optimization (control) problems. A major feature of this framework is a clear separation of the process of modeling a problem, versus the design of policies to solve the problem. In Part II, we provide additional discussion behind some of the more subtle concepts such as the construction of a state variable. We illustrate the modeling process using an energy storage problem. We then create five variations of this problem designed to bring out the features of the different policies. The first four of these problems demonstrate that each of the four classes of policies is best for particular problem characteristics. The fifth policy is a hybrid that illustrates the ability to combine the strengths of multiple policy classes.
This paper proposes a new approximate dynamic programming algorithm to solve the infinite-horizon optimal control problem for weakly coupled nonlinear systems. The algorithm is implemented as a three-critic/four-actor...
详细信息
This paper proposes a new approximate dynamic programming algorithm to solve the infinite-horizon optimal control problem for weakly coupled nonlinear systems. The algorithm is implemented as a three-critic/four-actor approximators structure, where the critic approximators are used to learn the optimal costs, while the actor approximators are used to learn the optimal control policies. Simultaneous continuous-time adaptation of both critic and actor approximators is implemented, a method commonly known as synchronous policy iteration. The adaptive control nature of the algorithm requires a persistence of excitation condition to be a priori guaranteed, but this can be relaxed by using previously stored data concurrently with current data in the update of the critic approximators. Appropriate robustifying terms are added to the controllers to eliminate the effects of the residual errors, leading to asymptotic stability of the equilibrium point of the closed-loop system. Simulation results show the effectiveness of the proposed approach for a sixth-order dynamical example. Copyright (C) 2015 John Wiley & Sons, Ltd.
This paper is concerned with a new data-driven zero-sum neuro-optimal control problem for continuous-time unknown nonlinear systems with disturbance. According to the input-output data of the nonlinear system, an effe...
详细信息
This paper is concerned with a new data-driven zero-sum neuro-optimal control problem for continuous-time unknown nonlinear systems with disturbance. According to the input-output data of the nonlinear system, an effective recurrent neural network is introduced to reconstruct the dynamics of the nonlinear system. Considering the system disturbance as a control input, a two-player zero-sum optimal control problem is established. Adaptive dynamicprogramming ( ADP) is developed to obtain the optimal control under the worst case of the disturbance. Three single-layer neural networks, including one critic and two action networks, are employed to approximate the performance index function, the optimal control law, and the disturbance, respectively, for facilitating the implementation of the ADP method. Convergence properties of the ADP method are developed to show that the system state will converge to a finite neighborhood of the equilibrium. The weight matrices of the critic and the two action networks are also convergent to finite neighborhoods of their optimal ones. Finally, the simulation results will show the effectiveness of the developed data-driven ADP methods.
In this paper, a value iteration adaptive dynamicprogramming (ADP) algorithm is developed to solve infinite horizon undiscounted optimal control problems for discrete-time nonlinear systems. The present value iterati...
详细信息
In this paper, a value iteration adaptive dynamicprogramming (ADP) algorithm is developed to solve infinite horizon undiscounted optimal control problems for discrete-time nonlinear systems. The present value iteration ADP algorithm permits an arbitrary positive semi-definite function to initialize the algorithm. A novel convergence analysis is developed to guarantee that the iterative value function converges to the optimal performance index function. Initialized by different initial functions, it is proven that the iterative value function will be monotonically nonincreasing, monotonically nondecreasing, or nonmonotonic and will converge to the optimum. In this paper, for the first time, the admissibility properties of the iterative control laws are developed for value iteration algorithms. It is emphasized that new termination criteria are established to guarantee the effectiveness of the iterative control laws. Neural networks are used to approximate the iterative value function and compute the iterative control law, respectively, for facilitating the implementation of the iterative ADP algorithm. Finally, two simulation examples are given to illustrate the performance of the present method.
This research introduces the use of approximate dynamic programming to overcome a variety of limitations of distinct infrastructure management problem formulations. The form, as well as the parameters, of a model spec...
详细信息
This research introduces the use of approximate dynamic programming to overcome a variety of limitations of distinct infrastructure management problem formulations. The form, as well as the parameters, of a model specifying the long-term costs associated with alternate infrastructure maintenance policies are learned via simulation. The introduced methodology makes it possible to manage large heterogeneous networks of facilities related by budgetary restrictions and resource constraints as well as by dependencies in maintenance costs or deterioration. In addition, the methodology is particularly well suited to consideration of multiple types of infrastructure condition data at the same time, including continuous-valued data and relevant historical data. Introduced techniques will prove valuable when high-quality deterioration and cost estimation models are available but are ill suited for use in a Markov decision problem framework. Computational studies show that the introduced approach is able to find an optimal solution to a relatively simple infrastructure management problem, and is able to find increasingly good solutions to a more complex problem.
The main focus of this article is to present a proposal to solve, via UDUT factorisation, the convergence and numerical stability problems that are related to the covariance matrix ill-conditioning of the recursive le...
详细信息
The main focus of this article is to present a proposal to solve, via UDUT factorisation, the convergence and numerical stability problems that are related to the covariance matrix ill-conditioning of the recursive least squares (RLS) approach for online approximations of the algebraic Riccati equation (ARE) solution associated with the discrete linear quadratic regulator (DLQR) problem formulated in the actor-critic reinforcement learning and approximate dynamic programming context. The parameterisations of the Bellman equation, utility function and dynamic system as well as the algebra of Kronecker product assemble a framework for the solution of the DLQR problem. The condition number and the positivity parameter of the covariance matrix are associated with statistical metrics for evaluating the approximation performance of the ARE solution via RLS-based estimators. The performance of RLS approximators is also evaluated in terms of consistence and polarisation when associated with reinforcement learning methods. The used methodology contemplates realisations of online designs for DLQR controllers that is evaluated in a multivariable dynamic system model.
We investigate a class of scheduling problems where dynamically and stochastically arriving appointment requests are either rejected or booked for future slots. A customer may cancel an appointment. A customer who doe...
详细信息
We investigate a class of scheduling problems where dynamically and stochastically arriving appointment requests are either rejected or booked for future slots. A customer may cancel an appointment. A customer who does not cancel may fail to show up. The planner may overbook appointments to mitigate the detrimental effects of cancellations and no-shows. A customer needs multiple renewable resources. The system receives a reward for providing service;and incurs costs for rejecting requests, appointment delays, and overtime. Customers are heterogeneous in all problem parameters. We provide a Markov decision process (MDP) formulation of these problems. Exact solution of this MDP is intractable. We show that this MDP has a weakly coupled structure that enables us to apply an approximate dynamic programming method rooted in Lagrangian relaxation, affine value function approximation, and constraint generation. We compare this method with a myopic scheduling heuristic on eighteen hundred problem instances. Our experiments show that there is a statistically significant difference in the performance of the two methods in 77% of these instances. Of these statistically significant instances, the Lagrangian method outperforms the myopic method in 97% of the instances. (C) 2015 Elsevier Ltd. All rights reserved.
Inventory management of procurement system is decomposed into sub-problems according to the timescale of decisions: the long-term planning for ordering raw materials and the short-term scheduling for unloading the ord...
详细信息
Inventory management of procurement system is decomposed into sub-problems according to the timescale of decisions: the long-term planning for ordering raw materials and the short-term scheduling for unloading the orders. To ensure more sustainable and robust operation, different decision layers should be integrated (which is nature of multi-scale), and supply and demand uncertainty should be considered. In this study, the planning problem is formulated as a Markov decision process (MDP) to incorporate possible realizations of uncertainty into the decision-making process. The MDP planning model is integrated with a scheduling model expressed by a MILP (or closely approximated by a heuristic approach). Decision policies are obtained from solving the MDP problem through an exact value iteration, as well as an approximate approach intended to alleviate the computational challenges. We compare the results from applying them with those of a reference policy obtained without any rigorous integration with scheduling through benchmark problems. (C) 2016 Elsevier Ltd. All rights reserved.
暂无评论