Consider an agent who faces a sequentialdecision problem. At each stage the agent takes an action and observes a stochastic outcome (e.g., daily prices, weather conditions, opponents' actions in a repeated game, ...
详细信息
Consider an agent who faces a sequentialdecision problem. At each stage the agent takes an action and observes a stochastic outcome (e.g., daily prices, weather conditions, opponents' actions in a repeated game, etc.). The agent's stage-utility depends on his action, the observed outcome and on previous outcomes. We assume the agent is Bayesian and is endowed with a subjective belief over the distribution of outcomes. The agent's initial belief is typically inaccurate. Therefore, his subjectively optimal strategy is initially suboptimal. As time passes information about the true dynamics is accumulated and, depending on the compatibility of the belief with respect to the truth, the agent may eventually learn to optimize. We introduce the notion of relative entropy, which is a natural adaptation of the entropy of a stochastic process to the subjective set-up. We present conditions, expressed in terms of relative entropy, that determine whether the agent will eventually learn to optimize. It is shown that low entropy yields asymptotic optimal behavior. In addition, we present a notion of pointwise merging and link it with relative entropy. (C) 2000 Elsevier Science S.A. All rights reserved.
Value-function approximation is investigated for the solution via Dynamic Programming (DP) of continuous-state sequential N-stage decisionproblems, in which the reward to be maximized has an additive structure over a...
详细信息
Value-function approximation is investigated for the solution via Dynamic Programming (DP) of continuous-state sequential N-stage decisionproblems, in which the reward to be maximized has an additive structure over a finite number of stages. Conditions that guarantee smoothness properties of the value function at each stage are derived. These properties are exploited to approximate such functions by means of certain nonlinear approximation schemes, which include splines of suitable order and Gaussian radial-basis networks with variable centers and widths. The accuracies of suboptimal solutions obtained by combining DP with these approximation tools are estimated. The results provide insights into the successful performances appeared in the literature about the use of value-function approximators in DP. The theoretical analysis is applied to a problem of optimal consumption, with simulation results illustrating the use of the proposed solution methodology. Numerical comparisons with classical linear approximators are presented.
This paper describes a new mathematical programming approach to sequential decision problems that have an underlying decision tree structure. The approach, based upon a characterization of strategies as extreme points...
详细信息
The price demand relation is a fundamental concept that models how price affects the sale of a product. It is critical to have an accurate estimate of its parameters, as it will impact the company's revenue. The l...
详细信息
The price demand relation is a fundamental concept that models how price affects the sale of a product. It is critical to have an accurate estimate of its parameters, as it will impact the company's revenue. The learning has to be performed very efficiently using a small window of a few test points, because of the rapid changes in price demand parameters due to seasonality and fluctuations. However, there are conflicting goals when seeking the two objectives of revenue maximization and demand learning, known as the learn/earn trade-off. This is akin to the exploration/exploitation trade-off that we encounter in machine learning and optimization algorithms. In this paper, we consider the problem of price demand function estimation, taking into account its exploration-exploitation characteristic. We design a new objective function that combines both aspects. This objective function is essentially the revenue minus a term that measures the error in parameter estimates. Recursive algorithms that optimize this objective function are derived. The proposed method outperforms other existing approaches.
Price experimentation is an important tool for firms to find the optimal selling price of their products. It should be conducted properly, since experimenting with selling prices can be costly. A firm, therefore, need...
详细信息
Price experimentation is an important tool for firms to find the optimal selling price of their products. It should be conducted properly, since experimenting with selling prices can be costly. A firm, therefore, needs to find a pricing policy that optimally balances between learning the optimal price and gaining revenue. In this paper, we propose such a pricing policy, called controlled variance pricing (CVP). The key idea of the policy is to enhance the certainty equivalent pricing policy with a taboo interval around the average of previously chosen prices. The width of the taboo interval shrinks at an appropriate rate as the amount of data gathered gets large;this guarantees sufficient price dispersion. For a large class of demand models, we show that this procedure is strongly consistent, which means that eventually the value of the optimal price will be learned, and derive upper bounds on the regret, which is the expected amount of money lost due to not using the optimal price. Numerical tests indicate that CVP performs well on different demand models and time scales.
We consider sequential decision problems over an infinite horizon. The forecast or solution horizon approach to solving such problems requires that the optimal initial decision be unique. We show that multiple optimal...
详细信息
We consider sequential decision problems over an infinite horizon. The forecast or solution horizon approach to solving such problems requires that the optimal initial decision be unique. We show that multiple optimal initial decisions can exist in general and refer to their existence as degeneracy. We then present a conceptual cost perturbation algorithm for resolving degeneracy and identifying a forecast horizon. We also present a general near-optimal forecast horizon.
This paper raises the question of solving multi-criteria sequential decision problems under uncertainty. It proposes to extend to possibilistic decision trees the decision rules presented in [1] for non sequential pro...
详细信息
ISBN:
(纸本)9783319615813;9783319615806
This paper raises the question of solving multi-criteria sequential decision problems under uncertainty. It proposes to extend to possibilistic decision trees the decision rules presented in [1] for non sequentialproblems. It present a series of algorithms for this new framework: Dynamic Programming can be used and provide an optimal strategy for rules that satisfy the property of monotonicity. There is no guarantee of optimality for those that do not-hence the definition of dedicated algorithms. This paper concludes by an empirical comparison of the algorithms.
We develop a new formalism for solving team Markov decision processes (MDPs), called marginal-contribution stochastic games (MCSGs). In MCSGs, each agent's utility for a state transition is given by its marginal c...
详细信息
ISBN:
(数字)9783319131917
ISBN:
(纸本)9783319131917;9783319131900
We develop a new formalism for solving team Markov decision processes (MDPs), called marginal-contribution stochastic games (MCSGs). In MCSGs, each agent's utility for a state transition is given by its marginal contribution to the team value function so that utilities differ between agents, and sparse interaction between them is naturally exploited. We prove that a MCSG admits a potential function and show that the locally optimal solutions, including the global optimum, correspond to the Nash equilibria of the game. We go on to show that any Nash equilibrium of a dynamic resource allocation problem with monotone submodular resource functions in MCSG form has a price of anarchy of > 1/2. Finally, we characterize a class of distributed algorithms for MCSGs.
SAMUEL is an experimental learning system that uses genetic algorithms and other learning methods to evolve reactive decision rules from simulations of multiagent environments. The basic approach is to explore a range...
详细信息
Sutton’s Dyna framework provides a novel and computationally appealing way to integrate learning, planning, and reacting in autonomous agents. Examined here is a class of strategies designed to enhance the learning a...
详细信息
暂无评论