作者:
Ren, JinengWenzhou Univ
Chashan Univ Town Sch Comp Sci & Artificial Intelligence Wenzhou 325035 Zhejiang Peoples R China Wenzhou Univ
Chashan Univ Town Artificial Intelligence & Adv Mfg Inst Yongjia Wenzhou 325035 Zhejiang Peoples R China
This paper proposes a gradient-based multi-agent actor-critic algorithm for off-policy reinforcement learning using importance sampling. Our algorithm is incremental with full gradients, and its complexity per iterati...
详细信息
This paper proposes a gradient-based multi-agent actor-critic algorithm for off-policy reinforcement learning using importance sampling. Our algorithm is incremental with full gradients, and its complexity per iteration scales linearly with the size of approximation features. Previous multi-agent actor-critic algorithms are limited to the on-policy setting or off-policy emphatic temporal difference (TD) learning and they do not take advantage of the advances in off-policy gradient temporal difference learning (GTD). As a theoretical contribution, we establish that the critic step of the proposed algorithm converges to the TD solution of the projected Bellman equation and the actor step converges to the set of asymptotically stable fixed points. Numerical experiments on the multi-agent generalization of the Boyan's chain problem show that the proposed approach provides improved performances in terms of stability and convergence rate as compared with the state-of-the-art baseline algorithm.
In order to improve spectrum efficiency in emergency communications, a dynamic spectrum sharing (DSS) scheme based on federated learning (FL) and deep reinforcement learning (DRL) is proposed. The operation model foll...
详细信息
ISBN:
(纸本)9798350333398
In order to improve spectrum efficiency in emergency communications, a dynamic spectrum sharing (DSS) scheme based on federated learning (FL) and deep reinforcement learning (DRL) is proposed. The operation model follows the paradigm of cognitive radio networks (CRNs), in which multiple secondary users (SUs) with different bandwidth requirements, spectrum sensing and access capabilities randomly access idle frequency bands that primary users (PUs) do not occupy. Different users in emergency communications are considered as SUs or PUs according to their communication priorities. A maximum entropy based multi-agentactor-critic (ME-MAAC) algorithm is used to realize an optimal spectrum sharing strategy by updating varying rewards to SUs. During the learning process, the FL algorithm is used to assign appropriate weights to SUs. Simulation results show that the performance of proposed scheme is better in terms of reward value, access rate, and convergence speed.
The widespread use of market-making algorithms in electronic over-the-counter markets may give rise to unexpected effects resulting from the autonomous learning dynamics of these algorithms. In particular the possibil...
详细信息
The widespread use of market-making algorithms in electronic over-the-counter markets may give rise to unexpected effects resulting from the autonomous learning dynamics of these algorithms. In particular the possibility of "tacit collusion" among market makers has increasingly received regulatory scrutiny. We model the interaction of market makers in a dealer market as a stochastic differential game of intensity control with partial information and study the resulting dynamics of bid-ask spreads. Competition among dealers is modeled as a Nash equilibrium, while collusion is described in terms of Pareto optima. Using a decentralized multi-agent deep reinforcement learning algorithm to model how competing market makers learn to adjust their quotes, we show that the interaction of market making algorithms via market prices, without any sharing of information, may give rise to tacit collusion, with spread levels strictly above the competitive equilibrium level.
暂无评论