Existing deep reinforcement learning (DRL) algorithms suffer from the problem of low sample efficiency. Episodic memory allows DRL algorithms to remember and use past experiences with high return, thereby improving sa...
详细信息
Existing deep reinforcement learning (DRL) algorithms suffer from the problem of low sample efficiency. Episodic memory allows DRL algorithms to remember and use past experiences with high return, thereby improving sample efficiency. However, due to the high dimensionality of the state-action space in continuous action tasks, previous methods in continuous action tasks often only utilize the information stored in episodic memory, rather than directly employing episodic memory for action selection as done in discrete action tasks. We suppose that episodic memory retains the potential to guide action selection in continuous control tasks. Our objective is to enhance sample efficiency by leveraging episodic memory for action selection in such tasks-either reducing the number of training steps required to achieve comparable performance or enabling the agent to obtain higher rewards within the same number of training steps. To this end, we propose an "Episodic Memory-Double actor-critic (EMDAC)"framework, which can use episodic memory for action selection in continuous action tasks. The critics and episodic memory evaluate the value of state-action pairs selected by the two actors to determine the final action. Meanwhile, we design an episodic memory based on a Kalman filter optimizer, which updates using the episodic rewards of collected state-action pairs. The Kalman filter optimizer assigns different weights to experiences collected at different time periods during the memory update process. In our episodic memory, state-action pair clusters are used as indices, recording both the occurrence frequency of these clusters and the value estimates for the corresponding state-action pairs. This enables the estimation of the value of state-action pair clusters by querying the episodic memory. After that, we design intrinsic reward based on the novelty of state-action pairs with episodic memory, defined by the occurrence frequency of state-action pair clusters, to enhance the
The problem of the Flexible Pickup and Delivery Services Problem (FPDSP) arises from the actual needs of multi-warehouse management strategies and is one of the key challenges in the current urban distribution logisti...
详细信息
The problem of the Flexible Pickup and Delivery Services Problem (FPDSP) arises from the actual needs of multi-warehouse management strategies and is one of the key challenges in the current urban distribution logistics industry. The problem aims to quickly calculate the route planning in complex scenarios to ensure that the total traveling time of the vehicle is minimized while meeting the time window requirements. To address this problem, we propose a deep reinforcement learning method based on the actor-critic algorithm to quickly calculate the approximate optimal solution of FPDSP. Specifically, we propose a Transformer Model with Parallel Encoders (TMPE). The model efficiently extracts order features through parallel encoders and then uses serial decoders to optimize the fusion of feature information to optimize the order selection process. In addition, we designed a reward function to reduce the number of repeated pickups made by the vehicle at the same consignor's location between different orders, thereby effectively reducing the vehicle's total travel time. Experimental results show that our method can quickly find feasible solutions to the problem compared with heuristic methods on seven different datasets. At the same time, compared with all baseline methods, the number of optimal solutions of our method reaches 14, which significantly improves the problem-solving ability. This result provides a new solution for optimizing pickup and delivery logistics in multiple warehouses in cities in the future.
In this article, we explore an event-triggered optimal control problem for nonlinear networked control systems (NCSs) with input saturation and aperiodic intermittent control. First, a non-quadratic cost function with...
详细信息
In this article, we explore an event-triggered optimal control problem for nonlinear networked control systems (NCSs) with input saturation and aperiodic intermittent control. First, a non-quadratic cost function with the property of intermittent control is formulated, and a Hamilton-Jacobi-Bellman (HJB) equation is designed based on the given cost function to acquire optimal control inputs. To avoid continuous-time communication in networks, a novel aperiodically intermittent dynamic event-triggered (AIDET) control scheme, integrating a dynamic event-triggered control scheme and an aperiodic intermittent control scheme, is proposed in this article. A piecewise continuous internal dynamic variable is introduced in the event-triggering condition, which is more conducive to increasing inter-event times than static event-triggering schemes. Furthermore, the event-triggering condition designed in this article is proven strictly to exclude the Zeno behavior. Moreover, due to the difficulty of directly solving the HJB equation, an actor-critic algorithm in the AIDET scheme is proposed to approximate the optimal control inputs. The approximation errors of weight vectors are proved to be uniformly ultimately bounded. The stability of the considered systems in the proposed AIDET control scheme is analyzed using the Lyapunov theory. Finally, some simulation examples are given to illustrate the effectiveness of the proposed actor-critic algorithm-based AIDET control scheme.
In this paper, a novel event-triggered control strategy is proposed for uncertain nonlinear systems by developing a fractional-order fuzzy sliding mode controller based on a fractional-order actor-critic network. The ...
详细信息
In this paper, a novel event-triggered control strategy is proposed for uncertain nonlinear systems by developing a fractional-order fuzzy sliding mode controller based on a fractional-order actor-critic network. The proposed approach offers several key features. First, a sigma-point Kalman filter is employed to accurately estimate unmeasured states. Second, a fractional-order sliding mode controller with an event-triggered mechanism is designed to achieve practical sliding mode control while preventing the Zeno phenomenon. Third, to reduce chattering in sliding mode control, a fractional-order actor-critic recurrent neural network is proposed, effectively approximating the switching control stage and enhancing system performance while reducing event triggers. The fractional-order actor-critic network incorporates fuzzy rules defined by a generalized Gaussian function with the Mittag-Leffler function, and a critic network approximates the value function, further enhancing performance. Parameter learning is guided by a fractional-order Gauss-Newton method. Stability analysis is performed using the Lyapunov method. Finally, the efficacy of the proposed method is demonstrated via experimental validation on a real inverted pendulum system.
We propose a novel actor-critic algorithm with guaranteed convergence to an optimal policy for a discounted reward Markov decision process. The actor incorporates a descent direction that is motivated by the solution ...
详细信息
We propose a novel actor-critic algorithm with guaranteed convergence to an optimal policy for a discounted reward Markov decision process. The actor incorporates a descent direction that is motivated by the solution of a certain non-linear optimization problem. We also discuss an extension to incorporate function approximation and demonstrate the practicality of our algorithms on a network routing application. (C) 2016 Elsevier B.V. All rights reserved.
To obtain useful information accurately and quickly from the massive text information is the most urgent need for people nowadays. The text automatic summarization technology summarizes and condenses the given source ...
详细信息
In recent years, deep graph neural networks (GNNs) have been used as solvers or helper functions for the traveling salesman problem (TSP), but they are usually used as encoders to generate static node representations ...
详细信息
In recent years, deep graph neural networks (GNNs) have been used as solvers or helper functions for the traveling salesman problem (TSP), but they are usually used as encoders to generate static node representations for downstream tasks and are incapable of obtaining the dynamic permutational information in completely updating solutions. For addressing this problem, we propose a permutational encoding graph attention encoder and attention-based decoder (PEG2A) model for the TSP that is trained by the advantage actor-critic algorithm. In this work, the permutational encoding graph attention (PEGAT) network is designed to encode node embeddings for gathering information from neighbors and obtaining the dynamic graph permutational information simultaneously. The attention-based decoder is tailored to compute probability distributions over picking pair nodes for 2-opt moves. The experimental results show that our method outperforms the compared learning-based algorithms and traditional heuristic methods.
In the pursuit of ubiquitous broadband connectivity, there has been a significant shift towards the vertical expansion of communication networks into space, particularly through the exploitation of low Earth orbit (LE...
详细信息
In the pursuit of ubiquitous broadband connectivity, there has been a significant shift towards the vertical expansion of communication networks into space, particularly through the exploitation of low Earth orbit (LEO) satellite constellations, which are favored for their relatively low latency. However, this approach faces many challenges that need to be addressed, including atmospheric turbulence, high path loss, and dynamic cloud formations. High-altitude pseudo-satellites (HAPS) have emerged as promising relaying layers between LEO satellites and ground stations, enhancing coverage, latency, and direct terrestrial user connectivity. While radio frequency (RF) bands suffer from congestion and limited bandwidth, free space optical (FSO) communications offer higher data rates, but are susceptible to misalignment and weather-induced signal degradation. To address these challenges, a hybrid RF/FSO approach has been proposed to take advantage of both technologies by dynamic switching between RF and FSO based on propagation channel conditions. This paper introduces a reinforcement learning-based algorithm designed to optimize the trajectory of HAPS, maneuver around cloudy areas, and seamlessly switch between the RF and FSO communication modes to maximize the achievable capacity. The proposed approach aims to maximize system performance by intelligently adapting to environmental conditions and offering a promising solution for next-generation space communication networks.
As one of the important complementary technologies of the fifth-generation (5G) wireless communication and beyond, mobile device-to-device (D2D) edge caching and computing can effectively reduce the pressure on backbo...
详细信息
As one of the important complementary technologies of the fifth-generation (5G) wireless communication and beyond, mobile device-to-device (D2D) edge caching and computing can effectively reduce the pressure on backbone networks and improve the user experience. Specific content can be pre-cached on the user devices based on personalized content placement strategies, and the cached content can be fetched by neighboring devices in the same D2D network. However, when multiple devices simultaneously fetch content from the same device, collisions will occur and reduce communication efficiency. In this paper, we design the content fetching strategies based on an actor-critic deep reinforcement learning (DRL) architecture, which can adjust the content fetching collision rate to adapt to different application scenarios. First, the optimization problem is formulated with the goal of minimizing the collision rate to improve the throughput, and a general actor-critic DRL algorithm is used to improve the content fetching strategy. Second, by optimizing the network architecture and reward function, the two-level actor-critic algorithm is improved to effectively manage the collision rate and transmission power. Furthermore, to balance the conflict between the collision rate and device energy consumption, the related reward values are weighted in the reward function to optimize the energy efficiency. The simulation results show that the content fetching collision rate based on the improved two-level actor-critic algorithm decreases significantly compared with that of the baseline algorithms, and the network energy consumption can be optimized by adjusting the weight factors.
This paper presents a partially model-free adaptive optimal control solution to the deterministic nonlinear discrete-time (DT) tracking control problem in the presence of input constraints. The tracking error dynamics...
详细信息
This paper presents a partially model-free adaptive optimal control solution to the deterministic nonlinear discrete-time (DT) tracking control problem in the presence of input constraints. The tracking error dynamics and reference trajectory dynamics are first combined to form an augmented system. Then, a new discounted performance function based on the augmented system is presented for the optimal nonlinear tracking problem. In contrast to the standard solution, which finds the feedforward and feedback terms of the control input separately, the minimization of the proposed discounted performance function gives both feedback and feedforward parts of the control input simultaneously. This enables us to encode the input constraints into the optimization problem using a nonquadratic performance function. The DT tracking Bellman equation and tracking Hamilton-Jacobi-Bellman (HJB) are derived. An actor-critic-based reinforcement learning algorithm is used to learn the solution to the tracking HJB equation online without requiring knowledge of the system drift dynamics. That is, two neural networks (NNs), namely, actor NN and critic NN, are tuned online and simultaneously to generate the optimal bounded control policy. A simulation example is given to show the effectiveness of the proposed method.
暂无评论