检索结果-内蒙古大学图书馆

A model-based deep reinforcement learning method applied to finite-horizon optimal control of nonlinear control-affine system

引用

JOURNAL OF PROCESS CONTROL 2020年 87卷 166-178页

作者： Kim, Jong Woo Park, Byung Jun Yoo, Haeun Oh, Tae Hoon Lee, Jay H. Lee, Jong Min Seoul Natl Univ Sch Chem & Biol Engn Inst Chem Proc 1 Gwanak Ro Seoul 08826 South Korea Korea Adv Inst Sci & Technol Dept Chem & Biomol Engn 291 Daehak Ro Daejeon 34141 South Korea

The Hamilton-Jacobi-Bellman (HJB) equation can be solved to obtain optimal closed-loop control policies for general nonlinear systems. As it is seldom possible to solve the HJB equation exactly for nonlinear systems, either analytically or numerically, methods to build approximate solutions through simulation based learning have been studied in various names like neurodynamic programming (NDP) and approximate dynamic programming (ADP). The aspect of learning connects these methods to reinforcement learning (RL), which also tries to learn optimal decision policies through trial-and-error based learning. This study develops a model-based RL method, which iteratively learns the solution to the HJB and its associated equations. We focus particularly on the control-affine system with a quadratic objective function and the finite horizon optimal control (FHOC) problem with time-varying reference trajectories. The HJB solutions for such systems involve time-varying value, costate, and policy functions subject to boundary conditions. To represent the time-varying HJB solution in high-dimensional state space in a general and efficient way, deep neural networks (DNNs) are employed. It is shown that the use of DNNs, compared to shallow neural networks (SNNs), can significantly improve the performance of a learned policy in the presence of uncertain initial state and state noise. Examples involving a batch chemical reactor and a one-dimensional diffusion-convection-reaction system are used to demonstrate this and other key aspects of the method. (C) 2020 Elsevier Ltd. All rights reserved.

关键词： Reinforcement learning approximate dynamic programming Deep neural networks Globalized dual heuristic programming Finite horizon optimal control problem Hamilton-Jacobi-Bellman equation

来源：评论

学校读者我要写书评

暂无评论

A dynamic mobile production capacity and inventory control problem

引用

IISE TRANSACTIONS 2020年第8期52卷 926-943页

作者： Malladi, Satya S. Erera, Alan L. White, Chelsea C., III Georgia Inst Technol Atlanta GA 30332 USA Tech Univ Denmark Lyngby Denmark

We analyze a problem of dynamic logistics planning given uncertain demands for a multi-location production-inventory system with transportable modular production capacity. In such systems, production modules provide capacity, and can be moved from one location to another to produce stock and satisfy demand. We formulate a dynamic programming model for a planning problem that considers production and inventory decisions, and develop suboptimal lookahead and rollout policies that use value function approximations based on geographic decomposition. Mixed-integer programming formulations are provided for several single-period optimization problems that define these policies. These models generalize a formulation for the single-period newsvendor problem, and in some cases the feasible region polyhedra contain only integer extreme points allowing efficient solution computation. A computational study with stationary demand distributions, which should benefit least from mobile capacity, provides an analysis of the effectiveness of these policies and the value that mobile production capacity provides. For problems with 20 production locations, the best suboptimal policies produce on average 13% savings over fixed capacity allocation policies when the costs of module movement, holding, and backordering are accounted for. Greater savings result when the number of locations increases.

关键词： Mobile modular production joint inventory control and capacity logistics approximate dynamic programming rollout heuristic

来源：评论

学校读者我要写书评

暂无评论

An Approximation Algorithm for Network Revenue Management Under Nonstationary Arrivals

引用

OPERATIONS RESEARCH 2020年第3期68卷 834-855页

作者： Ma, Yuhang Rusmevichientong, Paat Sumida, Mika Topaloglu, Huseyin Cornell Tech Sch Operat Res & Informat Engn New York NY 10044 USA Univ Southern Calif Marshall Sch Business Los Angeles CA 90089 USA

We provide an approximation algorithm for network revenue management problems. In our approximation algorithm, we construct an approximate policy using value function approximations that are expressed as linear combinations of basis functions. We use a backward recursion to compute the coefficients of the basis functions in the linear combinations. If each product uses at most L resources, then the total expected revenue obtained by our approximate policy is at least 1/(1 + L) of the optimal total expected revenue. In many network revenue management settings, although the number of resources and products can become large, the number of resources used by a product remains bounded. In this case, our approximate policy provides a constant-factor performance guarantee. Our approximate policy can handle nonstationarities in the customer arrival process. To our knowledge, our approximate policy is the first approximation algorithm for network revenue management problems under nonstationary arrivals. Our approach can incorporate the customer choice behavior among the products, and allows the products to use multiple units of a resource, while still maintaining the performance guarantee. In our computational experiments, we demonstrate that our approximate policy performs quite well, providing total expected revenues that are substantially better than its theoretical performance guarantee.

关键词： network revenue management approximate dynamic programming dynamic pricing

来源：评论

学校读者我要写书评

暂无评论

Meso-parametric value function approximation for dynamic customer acceptances in delivery routing

引用

EUROPEAN JOURNAL OF OPERATIONAL RESEARCH 2020年第1期285卷 183-195页

作者： Ulmer, Marlin W. Thomas, Barrett W. Tech Univ Carolo Wilhelmina Braunschweig Carl Friedrich Gauss Fak Muhlenpfordtstr 23 D-38106 Braunschweig Germany Univ Iowa Tippie Coll Business 108 John Pappajohn Business Bldg Iowa City IA 52242 USA

The rise of mobile communication, ample computing power, and Amazon's training of customers has led to last-mile delivery challenges and created struggles for companies seeking to budget their limited delivery resources efficiently to generate enough revenue. In this paper, we examine the capacitated customer acceptance problem with stochastic requests (CAPSR), a problem in which a company seeks to maximize expected revenue by accepting or rejecting requests. Each accepted request generates revenue and must be routed, consuming driver time and vehicle capacity. To solve the problem, we introduce a novel method of value function approximation (VFA). Conventionally, VFAs are either parametric (P-VFAs) or non-parametric (N-VFAs). Both VFAs have advantages and shortcomings and their performances rely significantly on the structure of the underlying problem. To combine the advantages and to alleviate the shortcomings of P-VFA and N-VFA used individually, we present a novel method, meso-parametric value function approximation (M-VFA). The results of computational experiments show that the M-VFA outperforms benchmarks for the CAPSR and show M-VFA offers the advantages of the individual VFAs while alleviating their shortcomings. Most importantly, we demonstrate that simultaneous approximations lead to better outcomes than either N- and P-VFA individually or some ex-post combination. (C) 2019 Published by Elsevier B.V.

关键词： dynamic customer acceptances dynamic vehicle routing dynamic multi-dimensional knapsack problem approximate dynamic programming Value function approximation

来源：评论

学校读者我要写书评

暂无评论

An Approximation Approach for Response-Adaptive Clinical Trial Design

引用

INFORMS JOURNAL ON COMPUTING 2020年第4期32卷 877-894页

作者： Ahuja, Vishal Birge, John R. Southern Methodist Univ Cox Sch Business Dallas TX 75275 USA Univ Chicago Booth Sch Business Chicago IL 60637 USA

Multiarmed bandit (MAB) problems, typically modeled as Markov decision processes (MDPs), exemplify the learning versus earning trade-off. An area that has motivated theoretical research in MAB designs is the study of clinical trials, where the application of such designs has the potential to significantly improve patient outcomes. However, for many practical problems of interest, the state space is intractably large, rendering exact approaches to solving MDPs impractical. In particular, settings that require multiple simultaneous allocations lead to an expanded state and action-outcome space, necessitating the use of approximation approaches. We propose a novel approximation approach that combines the strengths of multiple methods: grid-based state discretization, value function approximation methods, and techniques for a computationally efficient implementation. The hallmark of our approach is the accurate approximation of the value function that combines linear interpolation with bounds on interpolated value and the addition of a learning component to the objective function. Computational analysis on relevant datasets shows that our approach outperforms existing heuristics (e.g., greedy and upper confidence bound family of algorithms) and a popular Lagrangian-based approximation method, where we find that the average regret improves by up to 58.3%. A retrospective implementation on a recently conducted phase 3 clinical trial shows that our design could have reduced the number of failures by 17% relative to the randomized control design used in that trial. Our proposed approach makes it practically feasible for trial administrators and regulators to implement Bayesian response-adaptive designs on large clinical trials with potential significant gains.

关键词： adaptive clinical trials Markov decision process grid-based approximation adaptive sampling approximate dynamic programming

来源：评论

学校读者我要写书评

暂无评论

A Multi-Critic Reinforcement Learning Method: An Application to Multi-Tank Water Systems

引用

IEEE ACCESS 2020年 8卷 173227-173238页

作者： Martinez-Piazuelo, Juan Ochoa, Daniel E. Quijano, Nicanor Giraldo, Luis Felipe Univ los Andes Dept Ingn Elect & Elect Bogota 111711 Colombia Univ Colorado Dept Elect Comp & Energy Engn Boulder CO 80309 USA

This paper investigates the combination of reinforcement learning and neural networks applied to the data-driven control of dynamical systems. In particular, we propose a multi-critic actor-critic architecture that eases the value function learning task by distributing it into multiple neural networks. We also propose a filtered multi-critic approach that offers further performance improvements as it eases the training process of the control policy. All the studied methods are evaluated with several numerical experiments on multi-tank water systems with nonlinear coupled dynamics, where control is known to be a challenging task. The simulation results show that the proposed multi-critic scheme is able to outperform the standard actor-critic approach in terms of speed and sensitivity of the learning process. Moreover, the results show that the filtered multi-critic strategy outperforms the unfiltered one under these same terms. This document highlights the benefits of the multi-critic methodology on a state of the art reinforcement learning algorithm, the deep deterministic policy gradient, and demonstrates its application to multi-tank water systems relevant for industrial process control.

关键词： Neural networks Learning (artificial intelligence) Task analysis Training Control systems Process control Sensitivity Data-driven control approximate dynamic programming reinforcement learning actor-critic methods deep deterministic policy gradient water-tank systems

来源：评论

学校读者我要写书评

暂无评论

Robust optimal control for a class of nonlinear systems with unknown disturbances based on disturbance observer and policy iteration

引用

NEUROCOMPUTING 2020年 390卷 185-195页

作者： Song, Ruizhuo Lewis, Frank L. Univ Sci & Technol Beijing Sch Automat & Elect Engn Beijing 100083 Peoples R China Univ Texas Arlington UTA Res Inst Ft Worth TX 76118 USA

A robust optimal control method for a class of nonlinear systems with unknown disturbances is addressed in this paper. In this framework, adaptive dynamic programming (ADP) is presented to obtain the optimal control. On-policy learning allows the performance index function and the optimal control to be obtained iteratively. It is shown that the iterative performance index function is non-increasing. A nonlinear disturbance observer is designed to estimate external disturbances. The compensation control is used to compensate for the influence of the disturbances. It is proven that the disturbance observer error is exponentially stable, under some conditions. The properties of the nonlinear system with unknown disturbance steered by the robust optimal control input are also proven. Simulation results demonstrate the performance of the proposed robust optimal control scheme for the nonlinear system with unknown disturbance. (C) 2020 Elsevier B.V. All rights reserved.

关键词： Adaptive critic designs Adaptive dynamic programming approximate dynamic programming Optimal control On-policy Disturbance

来源：评论

学校读者我要写书评

暂无评论

Coordinating Pricing and Empty Container Repositioning in Two-Depot Shipping Systems

引用

TRANSPORTATION SCIENCE 2020年第6期54卷 1697-1713页

作者： Lu, Tao Lee, Chung-Yee Lee, Loo-Hay Univ Connecticut Sch Business Storrs CT 06269 USA Hong Kong Univ Sci & Technol Dept Ind Engn & Decis Analyt Kowloon Clear Water Bay Hong Kong Peoples R China Natl Univ Singapore Dept Ind Syst Engn & Management Singapore 119077 Singapore

This paper studies joint decisions on pricing and empty container repositioning in two-depot shipping services with stochastic shipping demand. We formulate the problem as a stochastic dynamic programming model. The exact dynamic program may have a high-dimensional state space because of the in-transit containers. To cope with the curse of dimensionality, we develop an approximate model where the number of in-transit containers on each vessel is approximated with a fixed container flow predetermined by solving a static version of the problem. Moreover, we show that the approximate value function is L-#-concave, thereby characterizing the structure of the optimal control policy for the approximate model. With the upper bound obtained by solving the information relaxation-based dual of the exact dynamic program, we numerically show that the control policies generated from our approximate model are close to optimal when transit times span multiple periods.

关键词： empty container repositioning dynamic pricing Markey decision process L-#-concavity approximate dynamic programming duality

来源：评论

学校读者我要写书评

暂无评论

Least squares policy iteration with instrumental variables vs. direct policy search: comparison against optimal benchmarks using energy storage

引用

INFOR 2020年第1期58卷 141-166页

作者： Moazehi, Somayeh Scott, Warren R. Powell, Warren B. Stevens Inst Technol Sch Business Hoboken NJ 07030 USA Princeton Univ Dept Operat Res & Financial Engn Princeton NJ 08544 USA

This article studies least-squares approximate policy iteration (API) methods with parametrized value-function approximation. We study several variations of the policy evaluation phase, namely, Bellman error minimization, Bellman error minimization with instrumental variables, projected Bellman error minimization, and projected Bellman error minimization with instrumental variables. For a general discrete-time stochastic control problem, Bellman error minimization policy evaluation using instrumental variables is equivalent to both variants of the projected Bellman error minimization. An alternative to these API methods is direct policy search based on knowledge gradient. The practical performance of these three approximate dynamic programming methods, (i) least squares API with Bellman error minimization, (ii) least squares API with Bellman error minimization with instrumental variables, and (iii) direct policy search, are investigated in the context of an application in energy storage operations management. We create a library of test problems using real-world data and apply value iteration to find their optimal policies. These optimal benchmarks are then used to compare the developed approximate dynamic programming policies. Our analysis indicates that least-squares API with instrumental variables Bellman error minimization prominently outperforms least-squares API with Bellman error minimization. However, these approaches underperform our direct policy search implementation.

关键词： dynamic programming approximate dynamic programming approximate policy iteration Bellman error minimization direct policy search energy storage

来源：评论

学校读者我要写书评

暂无评论

dynamic multi-priority, multi-class patient scheduling with stochastic service times

引用

EUROPEAN JOURNAL OF OPERATIONAL RESEARCH 2020年第1期280卷 254-265页

作者： Saure, Antoine Begen, Mehmet A. Patrick, Jonathan Univ Ottawa Telfer Sch Management 55 Laurier Ave East Ottawa ON K1N 6N5 Canada Western Univ Ivey Sch Business 1255 Western Rd London ON N6G 0N1 Canada

Efficient patient scheduling has significant operational, clinical and economical benefits on health care systems by not only increasing the timely access of patients to care but also reducing costs. However, patient scheduling is complex due to, among other aspects, the existence of multiple priority levels, the presence of multiple service requirements, and its stochastic nature. Patient appointment (allocation) scheduling refers to the assignment of specific appointment start times to a set of patients scheduled for a particular day while advance patient scheduling refers to the assignment of future appointment days to patients. These two problems have generally been addressed separately despite each being highly dependent on the form of the other. This paper develops a framework that incorporates stochastic service times into the advance scheduling problem as a first step towards bridging these two problems. In this way, we not only take into account the waiting time until the day of service but also the idle time/overtime of medical resources on the day of service. We first extend the current literature by providing theoretical and numerical results for the case with multi-class, multi-priority patients and deterministic service times. We then adapt the model to incorporate stochastic service times and perform a comprehensive numerical analysis on a number of scenarios, including a practical application. Results suggest that the advance scheduling policies based on deterministic service times cannot be easily improved upon by incorporating stochastic service times, a finding that has important implications for practice and future research on the combined problem. (C) 2019 Elsevier B.V. All rights reserved.

关键词： OR in health services Patient scheduling Markov decision processes approximate dynamic programming Linear programming

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：