检索结果-内蒙古大学图书馆

Approximate dynamic programming for pickup and delivery problem with crowd-shipping

TRANSPORTATION RESEARCH PART B-METHODOLOGICAL 2024年 187卷

作者： Mousavi, Kianoush Bodur, Merve Cevik, Mucahit Roorda, Matthew J. Univ Toronto Dept Civil & Mineral Engn Toronto ON Canada Univ Edinburgh Sch Math Edinburgh Scotland Toronto Metropolitan Univ Dept Mech Ind & Mechatron Engn Toronto ON Canada

We study a variant of dynamic pickup and delivery crowd-shipping operation for delivering online orders within a few hours from a brick-and-mortar store. This crowd-shipping operation is subject to a high degree of uncertainty due to the stochastic arrival of online orders and crowd- shippers that impose several challenges for efficient matching of orders to crowd-shippers. We formulate the problem as a Markov decision process and develop an Approximate Dynamic Programming (ADP) policy using value function approximation for obtaining a highly scalable and real-time matching strategy while considering temporal and spatial uncertainty in arrivals of online orders and crowd-shippers. We incorporate several algorithmic enhancements to the ADP algorithm, which significantly improve the convergence. We compare the ADP policy with an optimization-based myopic policy using various performance measures. Our numerical analysis with varying parameter settings shows that ADP policies can lead to up to 25.2% cost savings and a 9.8% increase in the number of served orders. Overall, we find that our proposed framework can guide crowd-shipping platforms for efficient real-time matching decisions and enhance the platform delivery capacity.

关键词： Crowd-shipping Last-mile delivery Markov decision process Approximate dynamic programming value function approximation

来源：评论

学校读者我要写书评

暂无评论

Reinforcement learning with automatic basis construction based on isometric feature mapping

引用

INFORMATION SCIENCES 2014年 286卷 209-227页

作者： Huang, Zhenhua Xu, Xin Zuo, Lei Natl Univ Def Technol Coll Mechatron & Automat Changsha 410073 Hunan Peoples R China

value function approximation (VFA) has been a major research topic in reinforcement learning. Although various reinforcement learning algorithms with VFA have been proposed, the performance of most previous algorithms depends on the predefined structure of the basis functions. To address this problem, this paper presents a novel basis learning method for VFA based on isometric feature mapping (IFM). In the proposed method, basis functions for VFA are automatically generated by constructing the optimal embedding basis of the data in a d-dimensional Euclidean space, which best preserves the estimated intrinsic geometry of the manifold. Furthermore, the IFM-based basis learning method is integrated with approximation policy iteration (API) for learning control in Markov decision problems with large state spaces. A new manifold reinforcement learning framework termed IFM-based API (IFM-API) is presented. Three learning control problems, including a real control system of the Googol single inverted pendulum, were studied to evaluate the performance of the proposed IFM-API algorithm. The simulation and experimental results show that, compared with other basis selection or learning methods, the IFM-based basis learning method can automatically compute an efficient set of basis functions with much fewer predefined parameters and less computational costs. Besides, it is illustrated that the proposed IFM-API algorithm can obtain better learning control policies than other API methods. (C) 2014 Elsevier Inc. All rights reserved.

关键词： Reinforcement learning Isometric feature mapping value function approximation Approximate policy iteration Learning control

来源：评论

学校读者我要写书评

暂无评论

Actor-Critic Learning Control Based on l₂-Regularized Temporal-Difference Prediction With Gradient Correction

引用

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018年第12期29卷 5899-5909页

作者： Li, Luntong Li, Dazi Song, Tianheng Xu, Xin Beijing Univ Chem Technol Dept Automat Beijing 100029 Peoples R China Natl Univ Def Technol Coll Mechatron & Automat Inst Unmanned Syst Changsha 410073 Hunan Peoples R China

Actor-critic based on the policy gradient (PG-based AC) methods have been widely studied to solve learning control problems. In order to increase the data efficiency of learning prediction in the critic of PG-based AC, studies on how to use recursive least-squares temporal difference (RLS-TD) algorithms for policy evaluation have been conducted in recent years. In such contexts, the critic RLS-TD evaluates an unknown mixed policy generated by a series of different actors, but not one fixed policy generated by the current actor. Therefore, this AC framework with RLS-TD critic cannot be proved to converge to the optimal fixed point of learning problem. To address the above problem, this paper proposes a new AC framework named critic-iteration PG (CIPG), which learns the state-value function of current policy in an on-policy way and performs gradient ascent in the direction of improving discounted total reward. During each iteration, CIPG keeps the policy parameters fixed and evaluates the resulting fixed policy by l(2)-regularized RLS-TD critic. Our convergence analysis extends previous convergence analysis of PG with function approximation to the case of RLS-TD critic. The simulation results demonstrate that the l(2)-regularization term in the critic of CIPG is undamped during the learning process, and CIPG has better learning efficiency and faster convergence rate than conventional AC learning control methods.

关键词： l(2)-regularization actor-critic (AC) policy gradient (PG) reinforcement learning (RL) value function approximation

来源：评论

学校读者我要写书评

暂无评论

Recursive Least-Squares Temporal Difference With Gradient Correction

引用

IEEE TRANSACTIONS ON CYBERNETICS 2021年第8期51卷 4251-4264页

作者： Song, Tianheng Li, Dazi Yang, Weimin Hirasawa, Kotaro Beijing Univ Chem Technol Coll Informat Sci & Technol Dept Automat Beijing 100029 Peoples R China Beijing Univ Chem Technol Coll Mech & Elect Engn Dept Mech Engn Beijing 100029 Peoples R China

Since the late 1980s, temporal difference (TD) learning has dominated the research area of policy evaluation algorithms. However, the demand for the avoidance of TD defects, such as low data-efficiency and divergence in off-policy learning, has inspired the studies of a large number of novel TD-based approaches. Gradient-based and least-squares-based algorithms comprise the major part of these new approaches. This paper aims to combine advantages of these two categories to derive an efficient policy evaluation algorithm with O(n(2)) per-time-step runtime complexity. The least-squares-based framework is adopted, and the gradient correction is used to improve convergence performance. This paper begins with the revision of a previous O(n(3)) batch algorithm, least-squares TD with a gradient correction (LS-TDC) to regularize the parameter vector. Based on the recursive least-squares technique, an O(n(2)) counterpart of LS-TDC called RC is proposed. To increase data efficiency, we generalize RC with eligibility traces. An off-policy extension is also proposed based on importance sampling. In addition, the convergence analysis for RC as well as LS-TDC is given. The empirical results in both on-policy and off-policy benchmarks show that RC has a higher estimation accuracy than that of RLSTD and a significantly lower runtime complexity than that of LSTDC.

关键词： Policy evaluation reinforcement learning (RL) temporal differences (TDs) value function approximation

来源：评论

学校读者我要写书评

暂无评论

Operational planning and optimal sizing of microgrid considering multi-scale wind uncertainty

引用

APPLIED ENERGY 2017年第Jun.1期195卷 616-633页

作者： Shin, Joohyun Lee, Jay H. Realff, Matthew J. Korea Adv Inst Sci & Technol Chem & Biomol Engn Dept Daejeon South Korea Georgia Inst Technol Chem & Biomol Engn Dept Atlanta GA 30332 USA

Distributed and on-site energy generation and distribution systems employing renewable energy sources and energy storage devices (referred to as microgrids) have been proposed as a new design approach to meet our energy needs more reliably and with lower carbon footprint. Management of such a system is a multi-scale decision-making problem encompassing hourly dispatch, daily unit commitment (UC), and yearly sizing for which efficient formulations and solution algorithms are lacking thus far. Its dynamic nature and high uncertainty are additional factors in limiting efficient and reliable operation. In this study, two-stage stochastic programming (2SSP) for day-ahead UC and dispatch decisions is combined with a Markov decision process (MDP) evolving at a daily timescale. The one-day operation model is integrated with the MDP by using the value of a state of commitment and battery at the end of a day to ensure longer term implications of the decisions within the day are considered. In the MDP formulation, capturing daily evolving exogenous information, the value function is recursively approximated with sampled observations estimated from the daily 2SSP model. With this value function capturing all future operating costs, optimal sizing of the wind farm and battery devices is determined based on a surrogate function optimization. Meanwhile, a multi-scale wind model consistent from seasonal to hourly is developed for the connection of the decision hierarchy across the scales. The results of the proposed integrated approach are compared to those of the daily independent 2SSP model through a case study and real wind data. (C) 2017 Elsevier Ltd. All rights reserved.

关键词： Microgrid operation and design Multi-scale decision making Wind uncertainty Stochastic optimization value function approximation

来源：评论

学校读者我要写书评

暂无评论

Generalized attention-weighted reinforcement learning

引用

NEURAL NETWORKS 2022年 145卷 10-21页

作者： Bramlage, Lennart Cortese, Aurelio Bielefeld Univ Fac Technol D-33615 Bielefeld Germany ATR Inst Int Computat Neurosci Labs Seika 6190288 Japan

In neuroscience, attention has been shown to bidirectionally interact with reinforcement learning (RL) to reduce the dimensionality of task representations, restricting computations to relevant features. In machine learning, despite their popularity, attention mechanisms have seldom been administered to decision-making problems. Here, we leverage a theoretical model from computational neuroscience - the attention-weighted RL (AWRL), defining how humans identify task-relevant features (i.e., that allow value predictions) - to design an applied deep RL paradigm. We formally demonstrate that the conjunction of the self-attention mechanism, widely employed in machine learning, with value function approximation is a general formulation of the AWRL model. To evaluate our agent, we train it on three Atari tasks at different complexity levels, incorporating both task-relevant and irrelevant features. Because the model uses semantic observations, we can uncover not only which features the agent elects to base decisions on, but also how it chooses to compile more complex, relational features from simpler ones. We first show that performance depends in large part on the ability to compile new compound features, rather than mere focus on individual features. In line with neuroscience predictions, self-attention leads to high resiliency to noise (irrelevant features) compared to other benchmark models. Finally, we highlight the importance and separate contributions of both bottom -up and top-down attention in the learning process. Together, these results demonstrate the broader validity of the AWRL framework in complex task scenarios, and illustrate the benefits of a deeper integration between neuroscience-derived models and RL for decision making in machine learning. (C) 2021 The Author(s). Published by Elsevier Ltd.

关键词： Self-attention Decision-making value function approximation Deep reinforcement learning Representation learning Feature binding

来源：评论

学校读者我要写书评

暂无评论

Dynamic pricing for managed lanes with multiple entrances and exits

引用

TRANSPORTATION RESEARCH PART C-EMERGING TECHNOLOGIES 2018年第Nov.期96卷 304-320页

作者： Pandey, Venktesh Boyles, Stephen D. Univ Texas Austin Dept Civil Architectural & Environm Engn Austin TX 78712 USA

Priced managed lanes are increasingly being used to better utilize the existing capacity of the roadway to relieve congestion and offer reliable travel time to road users. In this paper, we investigate the optimization problem for pricing managed lanes with multiple entrances and exits which seeks to maximize the revenue and minimize the total system travel time (TSTT) over a finite horizon. We propose a lane choice model where travelers make online decisions at each diverge point considering all routes on a managed lane network. We formulate the problem as a deterministic Markov decision process and solve it using the value function approximation (VFA) method for different initializations. We compare the performance of the toll policies predicted by the VFA method against the myopic revenue policy which maximizes the revenue only at the current timestep and two heuristic policies based on the measured densities on the managed and general purpose lanes (GPIs). We test the results on four different test networks. The primary findings from our research suggest the usefulness of the VFA method for determining dynamic tolls. The best-found objective value from the method at its termination is better than other heuristics for all test networks with average improvements in the objective ranging between 10% and 90% for revenue maximization and 0-27% for TSTT minimization. Certain VFA initializations obtain best-found toll profiles within first 5-50 iterations which warrants computational time savings. Our findings also indicate that the revenue-maximizing optimal policies follow the "jam-and-harvest" behavior where the GPLs are pushed towards congestion in the earlier time steps to generate higher revenue in the later time steps, a characteristic not observed for the policies minimizing TSTT.

关键词： Managed lanes Dynamic pricing Route choices Approximate dynamic programming value function approximation

来源：评论

学校读者我要写书评

暂无评论

An Adaptive Policy Evaluation Network Based on Recursive Least Squares Temporal Difference With Gradient Correction

引用

IEEE ACCESS 2018年 6卷 7515-7525页

作者： Li, Dazi Wang, Yuting Song, Tianheng Jin, Qibing Beijing Univ Chem Technol Coll Informat Sci & Technol Beijing 100029 Peoples R China

Reinforcement learning (RL) is an important machine learning paradigm that can be used for learning from the data obtained by the human-computer interface and the interaction in human-centered smart systems. One of the essential problems in RL algorithms is the value functions. value functions are usually estimated via linearly parameterized value functions. Prior RL algorithms that generalize in this way required learning times tuning the linear weights leaving out the basis function. In fact, basis functions in value function approximation also have a significant influence on the performance. In this paper, a new adaptive policy evaluation network based on recursive least squares temporal difference (TD) with gradient correction (adaptive RC network) is proposed. Basis functions in the proposed algorithm were adaptive optimized, mainly aiming at the widths. In the proposed algorithm, TD error and value function were estimated by RC algorithm and value function approximation. The gradient derived from the squares of TD error was used to update the widths of basis functions. Therefore, the RC network can adjust its network parameters in an adaptive way with a self-organizing approach according to the progress in learning. Empirical results based on the three RL benchmarks show the performance and applicability of the proposed adaptive RC network.

关键词： Policy evaluation reinforcement learning recursive least squares temporal difference with gradient correction value function approximation

来源：评论

学校读者我要写书评

暂无评论

Network Effects and Multinetwork Sellers' Dynamic Pricing in the US Smartphone Market

引用

MANAGEMENT SCIENCE 2023年第6期69卷 3297-3318页

作者： Liu, Yue Luo, Rong Cent Univ Finance & Econ Sch Int Trade & Econ Beijing 102206 Peoples R China Renmin Univ China Sch Econ Beijing 100872 Peoples R China

Although the literature on network effects has focused on single-network firms, many industries feature multinetwork firms that play more complex dynamic pricing games. In this paper, we estimate the network effect at the smartphone operating system (OS) level and study multi-OS telecommunication carriers' dynamic pricing strategies in an oligopolistic setting, using data on the U.S. smartphone industry. We find a positive OS network effect. Counterfactual analysis indicates that if the carriers were single-OS sellers, they would increase the phone prices of large OSs and lower the prices of small OSs, which reduces the consumer surplus by $6.99 billion. Further analyses show that the multi-OS carriers choose lower prices for large OSs than for small OSs because of their preference for OS concentration.

关键词： network effect multinetwork firms dynamic pricing game network concentration value function approximation

来源：评论

学校读者我要写书评

暂无评论

Workforce Scheduling in the Era of Crowdsourced Delivery

引用

TRANSPORTATION SCIENCE 2020年第4期54卷 1113-1133页

作者： Ulmer, Marlin Savelsbergh, Martin Tech Univ Carolo Wilhelmina Braunschweig Carl Friedrich Gauss Fak D-38106 Braunschweig Germany Georgia Inst Technol H Milton Stewart Sch Ind & Syst Engn Atlanta GA 30332 USA

Using crowdsourced delivery capacity, that is, individuals offering their vehicle and their time to perform deliveries, can allow companies to provide faster delivery options and more easily accommodate fluctuations in demand. However, because of the uncertainty associated with crowdsourced delivery capacity, ensuring service quality is more challenging. To prevent or mitigate any negative effects of the uncertainty associated with crowdsourced delivery capacity, companies may choose to also have a scheduled delivery workforce that they can control more effectively. We investigate continuous approximation and value function approximation methods for scheduling this workforce, that is, deciding their shifts (start time and duration) to achieve a service level target at minimum cost. An extensive computational study demonstrates the efficacy of our methods and provides insights into the use of crowdsourced delivery capacity.

关键词： stochastic dynamic vehicle routing workforce scheduling crowdsourcing continuous approximation value function approximation

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：