In the stochastic multi-objective multi-armed bandit (or MOMAB), arms generate a vector of stochastic rewards, one per objective, instead of a single scalar reward. As a result, there is not only one optimal arm, but ...
详细信息
ISBN:
(纸本)9781479945528
In the stochastic multi-objective multi-armed bandit (or MOMAB), arms generate a vector of stochastic rewards, one per objective, instead of a single scalar reward. As a result, there is not only one optimal arm, but there is a set of optimal arms (pareto front) of reward vectors using the pareto dominance relation and there is a trade-off between finding the optimal arm set (exploration) and selecting fairly or evenly the optimal arms (exploitation). To trade-off between exploration and exploitation, either pareto knowledge gradient (or pareto-KG for short), or pareto upper confidence bound (or pareto-UCB1 for short) can be used. They combine the KG-policy and UCB1-policy, respectively with the pareto dominance relation. In this paper, we propose pareto Thompson sampling that uses pareto dominance relation to find the pareto front. We also propose annealing-paretoalgorithm that trades-off between the exploration and exploitation by using a decaying parameter epsilon(t) in combination with pareto dominance relation. The annealing-paretoalgorithm uses the decaying parameter to explore the pareto optimal arms and uses pareto dominance relation to exploit the pareto front. We experimentally compare pareto-KG, pareto-UCB1, pareto Thompson sampling and the annealing-paretoalgorithms on multi-objective Bernoulli distribution problems and we conclude that the annealing-pareto is the best performing algorithm.
暂无评论