This paper presents a new methodology to solve a general model of dynamic decision making with a continuous unknown parameter or state. The methodology centers on the "continuation-value functions" (mappings...
详细信息
This paper presents a new methodology to solve a general model of dynamic decision making with a continuous unknown parameter or state. The methodology centers on the "continuation-value functions" (mappings from the parameter space to the continuation-value space), created by feasible continuation policies. When the model primitives can be described through a family of basisfunctions (e.g., polynomials), a continuation-value function retains that property and can be represented by a basis weight vector. The set of efficient basis weight vectors can be constructed through backward induction, which leads to a significant reduction of problem complexity and enables an exact solution for small-sized problems. A set of approximation methods based on the new methodology is developed to tackle larger problems. The methodology is also extended to the multidimensional (multiparameter) setting, which features the problem of contextual multiarmed bandits with linear expected rewards. The approximation algorithm developed in this paper outperforms three benchmark algorithms (epsilon-greedy, Thompson sampling, and LinUCB) in learning situations with many actions and short horizons.
暂无评论