In the dynamic fi eld of deep reinforcement learning, the self -attention mechanism has been increasingly recognized. Nevertheless, its application in discrete problem domains has been relatively limited, presenting c...
详细信息
In the dynamic fi eld of deep reinforcement learning, the self -attention mechanism has been increasingly recognized. Nevertheless, its application in discrete problem domains has been relatively limited, presenting complex optimization challenges. This article introduces a pioneering deep reinforcement learning algorithm, termed Attention -based actor -critic with Priority Experience Replay (A2CPER). A2CPER combines the strengths of self -attention mechanisms with the actor -critic framework and prioritized experience replay to enhance policy formulation for discrete problems. The algorithm ' s architecture features dual networks within the actor -critic model - the actor formulates action policies and the critic evaluates state values to judge the quality of policies. The incorporation of target networks aids in stabilizing network optimization. Moreover, the addition of self -attention mechanisms bolsters the policy network ' s capability to focus on critical information, while priority experience replay promotes training stability and reduces correlation among training samples. Empirical experiments on discrete action problems validate A2CPER ' s adeptness at policy optimization, marking signi fi cant performance improvements across tasks. In summary, A2CPER highlights the viability of selfattention mechanisms in reinforcement learning, presenting a robust framework for discrete problem -solving and potential applicability in complex decision -making scenarios.
Markov games provide a powerful framework for modeling strategic multi-agent interactions in dynamic environments. Traditionally, convergence properties of decentralized learning algorithms in these settings have been...
详细信息
Markov games provide a powerful framework for modeling strategic multi-agent interactions in dynamic environments. Traditionally, convergence properties of decentralized learning algorithms in these settings have been established only for special cases, such as Markov zero-sum and potential games, which do not fully capture real-world interactions. In this letter, we address this gap by studying the asymptotic properties of learning algorithms in general-sum Markov games. In particular, we focus on a decentralized algorithm where each agent adopts an actor-critic learning dynamic with asynchronous step sizes. This decentralized approach enables agents to operate independently, without requiring knowledge of others' strategies or payoffs. We introduce the concept of a Markov Near-Potential Function (MNPF) and demonstrate that it serves as an approximate Lyapunov function for the policy updates in the decentralized learning dynamics, which allows us to characterize the convergent set of strategies. We further strengthen our result under specific regularity conditions and with finite Nash equilibria.
In this paper, an online optimization approach of a fractional-order PID controller based on a fractional-order actor-critic algorithm (FOPID-FOAC) is proposed. The proposed FOPID-FOAC scheme exploits the advantages o...
详细信息
In this paper, an online optimization approach of a fractional-order PID controller based on a fractional-order actor-critic algorithm (FOPID-FOAC) is proposed. The proposed FOPID-FOAC scheme exploits the advantages of the FOPID controller and FOAC approaches to improve the performance of nonlinear systems. The proposed FOAC is built by developing a FO-based learning approach for the actor-critic neural network with adaptive learning rates. Moreover, a FO rectified linear unit (RLU) is introduced to enable the AC neural network to define and optimize its own activation function. By the means of the Lyapunov theorem, the convergence and the stability analysis of the proposed algorithm are investigated. The FO operators for the FOAC learning algorithm are obtained using the gray wolf optimization (GWO) algorithm. The effectiveness of the proposed approach is proven by extensive simulations based on the tracking problem of the two degrees of freedom (2-DOF) helicopter system and the stabilization issue of the inverted pendulum (IP) system. Moreover, the performance of the proposed algorithm is compared against optimized FOPID control approaches in different system conditions, namely when the system is subjected to parameter uncertainties and external disturbances. The performance comparison is conducted in terms of two types of performance indices, the error performance indices, and the time response performance indices. The first one includes the integral absolute error (IAE), and the integral squared error (ISE), whereas the second type involves the rising time, the maximum overshoot (Max. OS), and the settling time. The simulation results explicitly indicate the high effectiveness of the proposed FOPID-FOAC controller in terms of the two types of performance measurements under different scenarios compared with the other control algorithms.
We propose a method to automatically select proper values of three thresholds in the Canny edge algorithm. Edge detection is widely used for object recognition, detection, and segmentation. Due to its good performance...
详细信息
We propose a method to automatically select proper values of three thresholds in the Canny edge algorithm. Edge detection is widely used for object recognition, detection, and segmentation. Due to its good performance, the Canny edge algorithm is still widely used among many edge detection algorithms. But, it requires manually selecting three appropriate thresholds for the given image. Some approaches have been proposed for automatically setting thresholds in the Canny edge algorithm. But, they either deal with partial among three entries or only show their performance in a limited range of variation. In natural scenes, images are acquired under various illumination, pose, and weather conditions. This paper proposes a method that can operate in various environments. We formulate the given problem by adopting an actor-critic algorithm. We propose an actor and critic network to solve the problem with an actor-critic algorithm. Also, we suggest a reward configuration based on an edge evaluation network and measure to prevent the reversal between high and low thresholds. The edge evaluation network uses an original image and an edge image as input. We set a negative reward when reversing the high and low thresholds occur. The proposed algorithm can adapt to unseen environments using images without requiring ground truth labels. Experimental results using diverse datasets show the feasibility of the proposed algorithm.
Vibration signals can be used to extract effective fault features for fault diagnosis. However, traditional supervised learning requires considerable manpower and time to mark samples manually, and this process is dif...
详细信息
Vibration signals can be used to extract effective fault features for fault diagnosis. However, traditional supervised learning requires considerable manpower and time to mark samples manually, and this process is difficult to apply to practical fault diagnosis. Deep reinforcement learning which combines the perception ability of deep learning with the decision-making ability of reinforcement learning, can independently extract hidden fault features and effectively improve the accuracy of fault diagnosis. Semi-supervised learning can reduce the proportion of labeled samples to decrease the learning cost while improving the recognition accuracy with unlabeled samples. In this study, we propose a novel semi-supervised deep reinforcement learning method. A semi-supervised generative adversarial network combined with the improved actor-critic algorithm is proposed to perform fault diagnosis when the labeled sample size is small. In the experiment of rolling bearing fault and engineering application, three-channel time-frequency graphs extracted from raw signals with the wavelet packet are compressed into single channel gray graphs. Then, to simulate the less labeled sample dataset, 2%, 5%, 20%, 50% and 100% sample labels are set by dislodging partial label from the processing sample. The results of the proposed method and other intelligent methods are listed to demonstrate that the proposed method could provide better performance over other methods even if the size of labeled sample is small in compound fault diagnosis.
The earth observation satellite(EOS) scheduling has problems such as complex constraints and difficulty in solving. For multi-orbit scheduling problem of the single EOS, this paper first describes the EOS scheduling p...
详细信息
The earth observation satellite(EOS) scheduling has problems such as complex constraints and difficulty in solving. For multi-orbit scheduling problem of the single EOS, this paper first describes the EOS scheduling process., and then Markov decision-making process is adopted to model the satellite scheduling process. The actor-critic(AC) algorithm is used to construct and train the scheduling decision-making model and the multistep estimation is used to calculate the advantage function. The experiment results demonstrate the satisfying efficiency, effectiveness and generalization ability of the suggested approach.
Deep learning is used for decision making and functional control in various fields, such as autonomous systems. However, rather than being developed by logical design, deep learning models are trained by itself throug...
详细信息
ISBN:
(纸本)9781665485500
Deep learning is used for decision making and functional control in various fields, such as autonomous systems. However, rather than being developed by logical design, deep learning models are trained by itself through learning data. Moreover, only reward values are used to evaluate its performance, which does not provide enough information that the model learned properly. This paper proposes a new method to assess the correctness of reinforcement learning, considering other properties of the learning algorithm. The proposed method is applied for the evaluation of actorcriticalgorithms, and correctness-related insights of the algorithm are confirmed through experiments.
The quality of fault recognition part is one of the key factors affecting the efficiency of intelligent manufacturing. Many excellent achievements in deep learning (DL) have been realized recently as methods of fault ...
详细信息
The quality of fault recognition part is one of the key factors affecting the efficiency of intelligent manufacturing. Many excellent achievements in deep learning (DL) have been realized recently as methods of fault recognition. However, DL models have inherent shortcomings. In particular, the phenomenon of over-fitting or degradation suggests that such an intelligent algorithm cannot fully use its feature perception ability. Researchers have mainly adapted the network architecture for fault diagnosis, but the above limitations are not taken into account. In this study, we propose a novel deep reinforcement learning method that combines the perception of DL with the decision-making ability of reinforcement learning. This method enhances the classification accuracy of the DL module to autonomously learn much more knowledge hidden in raw data. The proposed method based on the convolutional neural network (CNN) also adopts an improved actor-critic algorithm for fault recognition. The important parts in standard actor-critic algorithm, such as environment, neural network, reward, and loss functions, have been fully considered in improved actor-critic algorithm. Additionally, to fully distinguish compound faults under heavy background noise, multi-channel signals are first stacked synchronously and then input into the model in the end-to-end training mode. The diagnostic results on the compound fault of the bearing and tool in the machine tool experimental system show that compared with other methods, the proposed network structure has more accurate results. These findings demonstrate that under the guidance of the improved actor-critic algorithm and processing method for multi-channel data, the proposed method thus has stronger exploration performance.
Existing deep reinforcement learning (DRL) algorithms suffer from the problem of low sample efficiency. Episodic memory allows DRL algorithms to remember and use past experiences with high return, thereby improving sa...
详细信息
Existing deep reinforcement learning (DRL) algorithms suffer from the problem of low sample efficiency. Episodic memory allows DRL algorithms to remember and use past experiences with high return, thereby improving sample efficiency. However, due to the high dimensionality of the state-action space in continuous action tasks, previous methods in continuous action tasks often only utilize the information stored in episodic memory, rather than directly employing episodic memory for action selection as done in discrete action tasks. We suppose that episodic memory retains the potential to guide action selection in continuous control tasks. Our objective is to enhance sample efficiency by leveraging episodic memory for action selection in such tasks-either reducing the number of training steps required to achieve comparable performance or enabling the agent to obtain higher rewards within the same number of training steps. To this end, we propose an "Episodic Memory-Double actor-critic (EMDAC)"framework, which can use episodic memory for action selection in continuous action tasks. The critics and episodic memory evaluate the value of state-action pairs selected by the two actors to determine the final action. Meanwhile, we design an episodic memory based on a Kalman filter optimizer, which updates using the episodic rewards of collected state-action pairs. The Kalman filter optimizer assigns different weights to experiences collected at different time periods during the memory update process. In our episodic memory, state-action pair clusters are used as indices, recording both the occurrence frequency of these clusters and the value estimates for the corresponding state-action pairs. This enables the estimation of the value of state-action pair clusters by querying the episodic memory. After that, we design intrinsic reward based on the novelty of state-action pairs with episodic memory, defined by the occurrence frequency of state-action pair clusters, to enhance the
As one of the important complementary technologies of the fifth-generation (5G) wireless communication and beyond, mobile device-to-device (D2D) edge caching and computing can effectively reduce the pressure on backbo...
详细信息
As one of the important complementary technologies of the fifth-generation (5G) wireless communication and beyond, mobile device-to-device (D2D) edge caching and computing can effectively reduce the pressure on backbone networks and improve the user experience. Specific content can be pre-cached on the user devices based on personalized content placement strategies, and the cached content can be fetched by neighboring devices in the same D2D network. However, when multiple devices simultaneously fetch content from the same device, collisions will occur and reduce communication efficiency. In this paper, we design the content fetching strategies based on an actor-critic deep reinforcement learning (DRL) architecture, which can adjust the content fetching collision rate to adapt to different application scenarios. First, the optimization problem is formulated with the goal of minimizing the collision rate to improve the throughput, and a general actor-critic DRL algorithm is used to improve the content fetching strategy. Second, by optimizing the network architecture and reward function, the two-level actor-critic algorithm is improved to effectively manage the collision rate and transmission power. Furthermore, to balance the conflict between the collision rate and device energy consumption, the related reward values are weighted in the reward function to optimize the energy efficiency. The simulation results show that the content fetching collision rate based on the improved two-level actor-critic algorithm decreases significantly compared with that of the baseline algorithms, and the network energy consumption can be optimized by adjusting the weight factors.
暂无评论