A significant problem facing researchers in reinforcementlearning, and particularly in multi-objective learning, is the dearth of good benchmarks. In this paper, we present a method and software tool enabling the cre...
详细信息
ISBN:
(纸本)9781479945528
A significant problem facing researchers in reinforcementlearning, and particularly in multi-objective learning, is the dearth of good benchmarks. In this paper, we present a method and software tool enabling the creation of random problem instances, including multi-objective learning problems, with specific structural properties. This tool, called Merlin (for Multi-objective Environments for reinforcementlearning), provides the ability to control these features in predictable ways, thus allowing researchers to begin to build a more detailed understanding about what features of a problem interact with a given learning algorithm to improve or degrade the algorithm's performance. We present this method and tool, and briefly discuss the controls provided by the generator, its supported options, and their implications on the generated benchmark instances.
This paper proposes a methodology to estimate the maximum revenue that can be generated by a company that operates a high-capacity storage device to buy or sell electricity on the day-ahead electricity market. The met...
详细信息
ISBN:
(纸本)9781479945528
This paper proposes a methodology to estimate the maximum revenue that can be generated by a company that operates a high-capacity storage device to buy or sell electricity on the day-ahead electricity market. The methodology exploits the dynamicprogramming (DP) principle and is specified for hydrogen-based storage devices that use electrolysis to produce hydrogen and fuel cells to generate electricity from hydrogen. Experimental results are generated using historical data of energy prices on the Belgian market. They show how the storage capacity and other parameters of the storage device influence the optimal revenue. The main conclusion drawn from the experiments is that it may be advisable to invest in large storage tanks to exploit the inter-seasonal price fluctuations of electricity.
Briefly, main purpose of the paper is fourfold: a) Cognitive perception, which consists of two functional blocks: improved sparse-coding under the influence of perceptual attention for extracting relevant information ...
详细信息
ISBN:
(纸本)9781479945528
Briefly, main purpose of the paper is fourfold: a) Cognitive perception, which consists of two functional blocks: improved sparse-coding under the influence of perceptual attention for extracting relevant information from the observables and ignoring irrelevant information, followed by a Bayesian algorithm for state estimation. b) Entropic state of the perceptor, which provides feedback information to the controller. c) Cognitive control, which also consists of two functional blocks: executive learning algorithm computed by processing the entropic state, followed by predictive planning to set the stage for policy to act on the environment, thereby establishing the global perception-action cycle. d) Experimental results for exploiting the perceptual as well as executive attention in a co-operative manner, which is aimed at the first demonstration of risk control in the presence of a severe disturbance in the environment.
Decentralized partially observable Markov decision processes (Dec-POMDPs) model cooperative multiagent scenarios, providing a powerful general framework for team-based artificial intelligence. While optimal algorithms...
详细信息
ISBN:
(纸本)9781479945528
Decentralized partially observable Markov decision processes (Dec-POMDPs) model cooperative multiagent scenarios, providing a powerful general framework for team-based artificial intelligence. While optimal algorithms exist for Dec-POMDPs, theoretical and empirical results demonstrate that they are impractical for many problems of real interest. We examine the use of reinforcementlearning (RL) as a means to generate adequate, if not optimal, joint policies for Dec-POMDPs. It is easily demonstrated (and expected) that single-agent RL produces results of little joint utility. We therefore investigate heuristic methods, based upon the dynamics of the Dec-POMDP formulation, that bias the learning process to produce coordinated action. Empirical tests on a benchmark problem show that these heuristics significantly enhance learning performance, even out-performing a hand-crafted heuristic in cases where the learning process converges quickly.
A common complaint about reinforcementlearning (RL) is that it is too slow to learn a value function which gives good performance. This issue is exacerbated in continuous state spaces. This paper presents a straight-...
详细信息
ISBN:
(纸本)9781479945528
A common complaint about reinforcementlearning (RL) is that it is too slow to learn a value function which gives good performance. This issue is exacerbated in continuous state spaces. This paper presents a straight-forward approach to speeding-up and even improving RL solutions by reusing features learned during a pre-training phase prior to Q-learning. During pre-training, the agent is taught to predict state change given a state/action pair. The effect of pre-training is examined using the model-free Q-learning approach but could readily be applied to a number of RL approaches including model-based RL. The analysis of the results provides ample evidence that the features learned during pre-training is the reason behind the improved RL performance.
As more renewable, yet volatile, forms of energy like solar and wind are being incorporated into the grid, the problem of finding optimal control policies for energy storage is becoming increasingly important. These s...
详细信息
ISBN:
(纸本)9781479945528
As more renewable, yet volatile, forms of energy like solar and wind are being incorporated into the grid, the problem of finding optimal control policies for energy storage is becoming increasingly important. These sequential decision problems are often modeled as stochastic dynamic programs, but when the state space becomes large, traditional (exact) techniques such as backward induction, policy iteration, or value iteration quickly become computationally intractable. Approximate dynamicprogramming (ADP) thus becomes a natural solution technique for solving these problems to near-optimality using significantly fewer computational resources. In this paper, we compare the performance of the following: various approximation architectures with approximate policy iteration (API), approximate value iteration (AVI) with structured lookup table, and direct policy search on a benchmarked energy storage problem (i.e., the optimal solution is computable).
This paper presents a novel stochastic event-based near optimal control strategy to regulate a networked control system (NCS) represented as an uncertain nonlinear continuous time system. An online stochastic actor-cr...
详细信息
ISBN:
(纸本)9781479945528
This paper presents a novel stochastic event-based near optimal control strategy to regulate a networked control system (NCS) represented as an uncertain nonlinear continuous time system. An online stochastic actor-critic neural network (NN) based approach is utilized to achieve the near optimal regulation in the presence of network constraints, such as, network induced time-varying delays and random packet losses under event-based transmission of the feedback signals. The transformed nonlinear NCS in discrete-time after the incorporation the delays and packet losses is utilized for the actor-critic NN based controller design. To relax the knowledge of the control coefficient matrix, a NN based identifier is used. Event sampled state vector is utilized as NN inputs and their respective weights are updated non-periodically at the occurrence of events. Further, an event-trigger condition is designed by using the Lyapunov technique to ensure ultimate boundedness of all the closed-loop signals and save network resources and computation. Moreover, policy and value iterations are not utilized for the stochastic optimal regulator design. Finally, the analytical design is verified by using a numerical example by carrying out Monte-Carlo simulations.
This paper describes conditions for convergence to optimal values of the dynamicprogramming algorithm applied to total-cost Markov Decision Processes (MDPSs) with Borel state and action sets and with possibly unbound...
详细信息
ISBN:
(纸本)9781479945528
This paper describes conditions for convergence to optimal values of the dynamicprogramming algorithm applied to total-cost Markov Decision Processes (MDPSs) with Borel state and action sets and with possibly unbounded one-step cost functions. It also studies applications of these results to Partially Observable MDPs (POMDPs). It is well-known that POMDPs can be reduced to special MDPs, called Completely Observable MDPs (COMDPs), whose state spaces are sets of probabilities of the original states. This paper describes conditions on POMDPs under which optimal policies for COMDPs can be found by value iteration. In other words, this paper provides sufficient conditions for solving total-costs POMDPs with infinite state, observation and action sets by dynamicprogramming. Examples of applications to filtration, identification, and inventory control are provided.
In embedded control systems, the control input is computed based on sensing data of a plant in a processor and there is a delay, called the computation time delay, due to the computation and the data transmission. Whe...
详细信息
ISBN:
(纸本)9781479945528
In embedded control systems, the control input is computed based on sensing data of a plant in a processor and there is a delay, called the computation time delay, due to the computation and the data transmission. When we design an optimal controller, we need to take the delay into account to achieve its optimality. Moreover, in the case where it is difficult to identify a mathematical model of the plant, a model free approach is useful. Especially, the reinforcementlearning-based approach has been much attention to in the design of an adaptive optimal controller. In this paper, we assume that the plant is a linear system but the parameters of the plant are unknown. Then, we apply the reinforcementlearning to the design of an adaptive optimal digital controller with taking the computation time delay into consideration. First, we consider the case where all states of the plant are observed, and it takes L times to update the control input. An optimal feedback gain is learned from sequences of a pair of the state and the control input. Next, we consider the case where the control input is determined from outputs of the plant. We cannot use an observer to estimate the state of the plant since the parameters of the plant are unknown. So, we use a data-based control approach for the estimation. Finally, we apply the proposed adaptive optimal controller to attitude control of a quadrotor at the hovering state and show its efficiency by simulation.
Utility theory has served as a bedrock for modeling risk in economics. Where risk is involved in decision-making, for solving Markov decision processes (MDPs) via utility theory, the exponential utility (EU) function ...
详细信息
ISBN:
(纸本)9781479945528
Utility theory has served as a bedrock for modeling risk in economics. Where risk is involved in decision-making, for solving Markov decision processes (MDPs) via utility theory, the exponential utility (EU) function has been used in the literature as an objective function for capturing risk-averse behavior. The EU function framework uses a so-called risk-averseness coefficient (RAC) that seeks to quantify the risk appetite of the decision-maker. Unfortunately, as we show in this paper, the EU framework suffers from computational deficiencies that prevent it from being useful in practice for solution methods based on reinforcementlearning (RL). In particular, the value function becomes very large and typically the computer overflows. We provide a simple example to demonstrate this. Further, we show empirically how a variance-adjusted (VA) approach, which approximates the EU function objective for reasonable values of the RAC, can be used in the RL algorithm. The VA framework in a sense has two objectives: maximize expected returns and minimize variance. We conduct empirical studies on a VA-based RL algorithm on the semi-MDP (SMDP), which is a more general version of the MDP. We conclude with a mathematical proof of the boundedness of the iterates in our algorithm.
暂无评论