检索结果-内蒙古大学图书馆

A Low-Rank Approximation for MDPs via Moment Coupling

OPERATIONS RESEARCH 2024年第3期72卷 1255-1277页

作者： Zhang, Amy B. Z. Gurvich, Itai Cornell Univ Sch Operat Res & Informat Engn Ithaca NY 14853 USA Northwestern Univ Kellogg Sch Management Evanston IL 60208 USA

We introduce a framework to approximate Markov decision processes (MDPs) that stands on two pillars: (i) state aggregation, as the algorithmic infrastructure, and (ii) central-limit-theorem-type approximations, as the mathematical underpinning of optimality guarantees. The theory is grounded in recent work by Braverman et al. (2020) that relates the solution of the Bellman equation to that of a partial differential equation (PDE) where, in the spirit of the central limit theorem, the transition matrix is reduced to its local first and second moments. Solving the PDE is not required by our method. Instead, we construct a "sister" (controlled) Markov chain whose two local transition moments are approximately identical with those of the focal chain. Because of this moment matching, the original chain and its sister are coupled through the PDE, a coupling that facilitates optimality guarantees. Embedded into standard soft aggregation algorithms, moment matching provides a disciplined mechanism to tune the aggregation and disaggregation probabilities. Computational gains arise from the reduction of the effective state space from N to N12+e is as one might intuitively expect from approximations grounded in the central limit theorem.

关键词： Markov processes approximate dynamic programming state aggregation parameter design algorithm analysis

来源：评论

学校读者我要写书评

暂无评论

A tutorial on value function approximation for stochastic and dynamic transportation

引用

4OR-A QUARTERLY JOURNAL OF OPERATIONS RESEARCH 2024年第1期22卷 145-173页

作者： Heinold, Arne Univ Kiel Sch Econ & Business Kiel Germany

This paper provides an introductory tutorial on Value Function Approximation (VFA), a solution class from approximate dynamic programming. VFA describes a heuristic way for solving sequential decision processes like a Markov Decision Process. Real-world problems in supply chain management (and beyond) containing dynamic and stochastic elements might be modeled as such processes, but large-scale instances are intractable to be solved to optimality by enumeration due to the curses of dimensionality. VFA can be a proper method for these cases and this tutorial is designed to ease its use in research, practice, and education. For this, the tutorial describes VFA in the context of stochastic and dynamic transportation and makes three main contributions. First, it gives a concise theoretical overview of VFA's fundamental concepts, outlines a generic VFA algorithm, and briefly discusses advanced topics of VFA. Second, the VFA algorithm is applied to the taxicab problem that describes an easy-to-understand transportation planning task. Detailed step-by-step results are presented for a small-scale instance, allowing readers to gain an intuition about VFA's main principles. Third, larger instances are solved by enhancing the basic VFA algorithm demonstrating its general capability to approach more complex problems. The experiments are done with artificial instances and the respective Python scripts are part of an electronic appendix. Overall, the tutorial provides the necessary knowledge to apply VFA to a wide range of stochastic and dynamic settings and addresses likewise researchers, lecturers, tutors, students, and practitioners.

关键词： Tutorial Markov decision process approximate dynamic programming Value function approximation Reinforcement learning

来源：评论

学校读者我要写书评

暂无评论

Online accelerated data-driven learning for optimal feedback control of discrete-time partially uncertain systems

引用

INTERNATIONAL JOURNAL OF ADAPTIVE CONTROL AND SIGNAL PROCESSING 2024年第3期38卷 848-876页

作者： Somers, Luke Haddad, Wassim M. Kokolakis, Nick-Marios T. Vamvoudakis, Kyriakos G. Georgia Inst Technol Sch Aerosp Engn Atlanta GA USA Georgia Inst Technol Sch Aerosp Engn Atlanta GA 30332 USA

In this paper, we develop an online learning algorithm for solving the Bellman equation for affine in the control discrete-time nonlinear uncertain dynamical systems. To ensure accelerated learning of our algorithm in generating optimal control policies, we use an actor-critic structure predicated on higher-order tuner laws. More specifically, we construct a Nesterov-like architecture involving momentum-based learning laws leading to an accelerated convergence of the optimal control policy. The proposed online learning-based optimal control framework guarantees uniform ultimate boundedness of the closed-loop system under the assumption that the system is persistently excited. Finally, two illustrative numerical examples are provided to demonstrate the efficacy of the proposed approach.

关键词： approximate dynamic programming discrete-time systems high-order tuners momentum-based learning online learning optimal control uniform ultimate boundedness

来源：评论

学校读者我要写书评

暂无评论

A stabilizing reinforcement learning approach for sampled systems with partially unknown models

引用

INTERNATIONAL JOURNAL OF ROBUST AND NONLINEAR CONTROL 2024年第18期34卷 12389-12412页

作者： Beckenbach, Lukas Osinenko, Pavel Streif, Stefan Tech Univ Chemnitz Automatic Control & Dynam Syst Lab Chemnitz Germany Skolkovo Inst Sci & Technol Digital Engn Ctr Moscow Russia

Reinforcement learning is commonly associated with training of reward-maximizing (or cost-minimizing) agents, in other words, controllers. It can be applied in model-free or model-based fashion, using a priori or online collected system data to train involved parametric architectures. In general, online reinforcement learning does not guarantee closed loop stability unless special measures are taken, for instance, through learning constraints or tailored training rules. Particularly promising are hybrids of reinforcement learning with classical control approaches. In this work, we suggest a method to guarantee practical stability of the system-controller closed loop in a purely online learning setting, in other words, without offline training. Moreover, we assume only partial knowledge of the system model. To achieve the claimed results, we employ techniques of classical adaptive control. The implementation of the overall control scheme is provided explicitly in a digital, sampled setting. That is, the controller receives the state of the system and computes the control action at discrete, specifically, equidistant moments in time. The method is tested in adaptive traction control and cruise control where it proved to significantly reduce the cost.

关键词： adaptive control approximate dynamic programming optimal control reinforcement learning

来源：评论

学校读者我要写书评

暂无评论

Off-Policy Model-Free Learning for Multi-Player Non-Zero-Sum Games With Constrained Inputs

引用

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS 2023年第2期70卷 910-920页

作者： Huo, Yu Wang, Ding Qiao, Junfei Li, Menghua Beijing Univ Technol Beijing Inst Artificial Intelligence Fac Informat Technol Beijing Key Lab Computat Intelligence & Intelligen Beijing 100124 Peoples R China Beijing Univ Technol Beijing Inst Artificial Intelligence Fac Informat Technol Beijing Lab Smart Environm Protect Beijing 100124 Peoples R China

In this paper, multi-player non-zero-sum games with control constraints are studied by utilizing a novel model-free approach based on adaptive dynamic programming framework. First, the model-based policy iteration (PI) method is provided, which requires the system dynamics, and the convergence is demonstrated. Then, aiming to eliminate the need for the system dynamics, a model-free iterative method is obtained by using the off-policy integral reinforcement learning (IRL) scheme based on the PI approach. Moreover, the system data is collected in order to construct the model-free approach. Besides, we analyze the convergence of the off-policy IRL approach by proving the equivalence between the model-free iterative approach and the model-based iterative approach. Remarkably, in the implementation of the scheme, the control policy and cost function are approximated by utilizing the actor-critic networks. The least square algorithm is utilized to learn the actor-critic networks weights depended on the collected data sets. Finally, two cases are provided to demonstrate the effectiveness of the established framework.

关键词： Adaptive dynamic programming approximate dynamic programming continuous-time nonlinear systems input constraints integral reinforcement learning non-zero-sum games off-policy

来源：评论

学校读者我要写书评

暂无评论

Fourier-Hermite dynamic programming for Optimal Control

引用

IEEE TRANSACTIONS ON AUTOMATIC CONTROL 2023年第10期68卷 6377-6384页

作者： Hassan, Syeda Sakira Sarkka, Simo Aalto Univ Dept Elect Engn & Automat Espoo 02150 Finland

In this article, we propose a novel computational method for solving nonlinear optimal control problems. The method is based on the use of Fourier-Hermite series for approximating the action-value function arising in dynamic programming instead of the conventional Taylor-series expansion used in differential dynamic programming. The coefficients of the Fourier-Hermite series can be numerically computed by using sigma-point methods, which leads to a novel class of sigma-point-based dynamic programming methods. We also prove the quadratic convergence of the method and experimentally test its performance against other methods.

关键词： approximate dynamic programming differential dynamic programming Fourier-Hermite series sigma-point dynamic programming trajectory optimization

来源：评论

学校读者我要写书评

暂无评论

Combined Use of dynamic Inversion and Reinforcement Learning for Motion Control of an Supersonic Transport Aircraft

引用

OPTICAL MEMORY AND NEURAL NETWORKS 2024年第SUPPL3期33卷 S399-S413页

作者： Dhiman, Gaurav Tiumentsev, Yu. V. Tskhai, R. A. Natl Res Univ Moscow Aviat Inst Moscow 125080 Russia

The task of aircraft motion control has to be solved under conditions of numerous heterogeneous uncertainties both in the aircraft motion model and in the environment in which the aircraft is flying. These uncertainties, in particular, are caused by the fact that in the flight of the aircraft can occur various kinds of abnormal situations caused by failures of equipment and systems of the aircraft, damage to the airframe and propulsion system of the aircraft. Some of these failures and damages have a direct impact on the dynamic characteristics of the aircraft as a control object. In this regard, the problem arises of such an adjustment of aircraft control algorithms that would provide the ability to adapt to the changed dynamics of the aircraft. It is extremely difficult, and in some cases impossible, to foresee in advance all possible damages, failures and their combinations. Hence, it is necessary to implement adaptive flight control algorithms that are able to adjust to the changing situation. One of the effective tools for solving such problems is reinforcement learning in the approximate dynamic programming (ADP) variant, in combination with artificial neural networks. In the last decade, a family of methods known as Adaptive Critic Design (ACD) has been actively developed within the ADP approach to control the behavior of complex dynamic systems. In our paper we consider the application of one of the variants of the ACD approach, namely SNAC (Single Network Adaptive Critic) and its development through its joint use with the method of dynamic inversion. The effectiveness of this approach is demonstrated on the example of longitudinal motion control of a supersonic transport airplane.

关键词： aircraft motion control machine learning dynamic inversion approximate dynamic programming adaptive critic design SNAC approach adaptive control

来源：评论

学校读者我要写书评

暂无评论

Self-Guided approximate Linear Programs: Randomized Multi-Shot Approximation of Discounted Cost Markov Decision Processes

引用

MANAGEMENT SCIENCE 2025年第4期71卷 iv-vi, 2751-3636页

作者： Pakiman, Parshan Nadarajah, Selvaprabu Soheili, Negar Lin, Qihang Univ Illinois Coll Business Adm Chicago IL 60607 USA Univ Iowa Tippie Coll Business Iowa City IA 52242 USA

approximate linear programs (ALPs) are well-known models based on value function approximations (VFAs) to obtain policies and lower bounds on the optimal policy cost of discounted-cost Markov decision processes (MDPs). Formulating an ALP requires (i) basis functions, the linear combination of which defines the VFA, and (ii) a state relevance distribution, which determines the relative importance of different states in the ALP objective for the purpose of minimizing VFA error. Both of these choices are typically heuristic;basis function selection relies on domain knowledge, whereas the state-relevance distribution is specified using the frequency of states visited by a baseline policy. We propose a self-guided sequence of ALPs that embeds random basis functions obtained via inexpensive sampling and uses the known VFA from the previous iteration to guide VFA computation in the current iteration. In other words, this sequence takes multiple shots randomly approximating the MDP value function with VFA-based guidance between consecutive approximation attempts. Self-guided ALPs mitigate domain knowledge during basis function selection and the impact of the state-relevance-distribution choice, thus reducing the ALP implementation burden. We establish high-probability error bounds on the VFAs from this sequence and show that a worst-case measure of policy performance improved. We find that these favorable implementation and theoretical properties translate to encouraging numerical results on perishable inventory control and options pricing applications, where self-guided ALP policies improve upon policies from problem-specific methods. More broadly, our research takes a meaningful step toward application-agnostic policies and bounds for MDPs.

关键词： approximate linear programming random features Markov decision processes approximate dynamic programming reinforcement learning inventory control options pricing

来源：评论

学校读者我要写书评

暂无评论

A Bayesian learning and pricing model with multiple unknown demand parameters

引用

ANNALS OF OPERATIONS RESEARCH 2024年第1期343卷 493-513页

作者： Xiao, Baichun Yang, Wei Long Isl Univ Coll Management CW Post Brookville NY 11548 USA

This article presents a Bayesian learning model for demand estimation in revenue management. Different from most existing models in the literature, our discussion centers on demand functions with an arbitrary number of unknown and correlated parameters, and estimating them simultaneously. We formulate the problem as a Dirichlet learning model and show the search process converges to the true parameter values. As the observed data does not unambiguously reveal the underlying demand curve, the exploration scheme is notably different from conventional Dirichlet sampling process. We apply a partially observable Markov decision process to ensure the true demand curve surfaces as a favorite. Our pricing policy during the learning phase also differs from myopic heuristics by taking both the remaining time and unsold items into consideration. As incomplete learning remains a concern for all existing learning models, we show that the occurrence of uninformative prices is rooted in the dynamics of pricing, and prove that the proposed model is immune from incomplete learning. For revenue performance, the regret bounds established are comparable to the benchmark in the literature under similar conditions. Overall, the proposed model integrates the learning process with earning goals and offers a promising tool to achieve both targets.

关键词： Bayesian demand estimation Pricing approximate dynamic programming Incomplete learning

来源：评论

学校读者我要写书评

暂无评论

Balancing resources for dynamic vehicle routing with stochastic customer requests

引用

OR SPECTRUM 2024年第2期46卷 331-373页

作者： Soeffker, Ninja Ulmer, Marlin W. Mattfeld, Dirk C. Univ Vienna Dept Business Decis & Analyt Vienna Austria Otto von Guericke Univ Chair Management Sci Magdeburg Germany Tech Univ Carolo Wilhelmina Braunschweig Decis Support Grp Braunschweig Germany

We consider a service provider performing pre-planned service for initially known customers with a fleet of vehicles, e.g., parcel delivery. During execution, new dynamic service requests occur, e.g., for parcel pickup. The goal of the service provider is to serve as many dynamic requests as possible while ensuring service of all initial customers. The allocation of initial services impacts the potential of serving dynamic requests. An allocation aiming on a time-efficient initial routing leads to minimal overall workload regarding the initial solution but may congest some vehicles that are unable to serve additional requests along their routes. An even workload division is less efficient but grants all vehicles flexibility for additional services. In this paper, we investigate the balance between efficiency and flexibility. For the initial customers, we modify a routing algorithm to allow a shift between efficient initial routing and evenly balanced workloads. For effective dynamic decision making with respect to the dynamic requests, we present value function approximations with different feature sets capturing vehicle workload in different levels of detail. We show that sacrificing some initial routing efficiency in favor of a balanced vehicle workload is a key factor for a flexible integration of later customer requests that leads to an average improvement of 10.75%. Further, we show when explicitly depicting heterogeneity in the vehicle workload by features of the value function approximation provides benefits and that the best choice of features leads to an average improvement of 5.71% compared to the worst feature choice.

关键词： dynamic vehicle routing Same-day service approximate dynamic programming Value function approximation

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：