The classical Upper Confidence Bound(ucb) algorithm implemented overestimates of the true mean of reward distributions based on the sample mean and the number of times such arms were chosen to decide the best *** this...
详细信息
The classical Upper Confidence Bound(ucb) algorithm implemented overestimates of the true mean of reward distributions based on the sample mean and the number of times such arms were chosen to decide the best *** this case,the variances,the essential components of any distributions,of reward distributions are ***,in real-world applications,arms with relatively high means and small variances sometimes are preferable to arms with the highest mean and large *** concerns are considered risk-aversion in the Multi-Armed Bandits ***,since the combination of the variance and mean could help estimate the range of the distribution,proper utilization of the sample variance might help people to construct a tighter upper confidence bound to perform the ucb algorithm,leading to a smaller *** this paper,the author investigates and summarizes the current algorithm regarding risk aversion based on the estimations and implementations of variances in constructing the upper confidence bound in the ucb algorithm.
We consider the problem of cumulative reward maximization in multi-armed bandits. We address the security concerns that occur when data and computations are outsourced to an honest-but-curious cloud i.e., that execute...
详细信息
We consider the problem of cumulative reward maximization in multi-armed bandits. We address the security concerns that occur when data and computations are outsourced to an honest-but-curious cloud i.e., that executes tasks dutifully, but tries to gain as much information as possible. We consider situations where data used in bandit algorithms is sensitive and has to be protected e.g., commercial or personal data. We rely on cryptographic schemes and propose ucb-MS, a secure multi-party protocol based on the ucb algorithm. We prove that ucb-MS computes the same cumulative reward as ucb while satisfying desirable security properties. In particular, cloud nodes cannot learn the cumulative reward or the sum of rewards for more than one arm. Moreover, by analyzing messages exchanged among cloud nodes, an external observer cannot learn the cumulative reward or the sum of rewards produced by some arm. We show that the overhead due to cryptographic primitives is linear in the size of the input. Our implementation confirms the linear-time behavior and the practical feasibility of our protocol, on both synthetic and real-world data.
In this paper, we discuss the signal integrity of the printed circuit board(PCB) and put forward a new method of designing PCB parameters based on monte-carlo tree search(MCTS). First, effects and measurements of the ...
详细信息
ISBN:
(纸本)9781728107356
In this paper, we discuss the signal integrity of the printed circuit board(PCB) and put forward a new method of designing PCB parameters based on monte-carlo tree search(MCTS). First, effects and measurements of the signal integrity are analysed and presented. Next, the paper introduces the algorithm MCTS on how to conduct parameters assignments and presents the improved upper confidence bound(ucb) algorithm that adapts to the PCB parameter design. We simulate and collect eye metrics data on the software and accelerate the process of designing a signal integrated PCB by employing MCTS. Compared to randomly searching for optimal parameters combination, the method generates an optimal global consequence rather than working out a suboptimum result. The total time cost is within ten minutes, while the random search takes hours to get the result.
The principle of optimism in the face of uncertainty is known as a heuristic in sequential decision-making problems. Overtaking method based on this principle is an effective algorithm to solve multi-armed bandit prob...
详细信息
The principle of optimism in the face of uncertainty is known as a heuristic in sequential decision-making problems. Overtaking method based on this principle is an effective algorithm to solve multi-armed bandit problems. It was defined by a set of some heuristic patterns of the formulation in the previous study. The objective of the present paper is to redefine the value functions of Overtaking method and to unify the formulation of them. The unified Overtaking method is associated with upper bounds of confidence intervals of expected rewards on statistics. The unification of the formulation enhances the universality of Overtaking method. Consequently we newly obtain Overtaking method for the exponentially distributed rewards, numerically analyze it, and show that it outperforms ucb algorithm on average. The present study suggests that the principle of optimism in the face of uncertainty should be regarded as the statistics-based consequence of the law of large numbers for the sample mean of rewards and estimation of upper bounds of expected rewards, rather than as a heuristic, in the context of multi-armed bandit problems. (C) 2017 Elsevier B.V. All rights reserved.
A multi-armed bandit problem is a search problem on which a learning agent must select the optimal arm among multiple slot machines generating random rewards. ucb algorithm is one of the most popular methods to solve ...
详细信息
A multi-armed bandit problem is a search problem on which a learning agent must select the optimal arm among multiple slot machines generating random rewards. ucb algorithm is one of the most popular methods to solve multi-armed bandit problems. It achieves logarithmic regret performance by coordinating balance between exploration and exploitation. Since ucb algorithms, researchers have empirically known that optimistic value functions exhibit good performance in multi-armed bandit problems. The terms optimistic or optimism might suggest that the value function is sufficiently larger than the sample mean of rewards. The first definition of ucb algorithm is focused on the optimization of regret, and it is not directly based on the optimism of a value function. We need to think the reason why the optimism derives good performance in multi-armed bandit problems. In the present article, we propose a new method, which is called Overtaking method, to solve multi-armed bandit problems. The value function of the proposed method is defined as an upper bound of a confidence interval with respect to an estimator of expected value of reward: the value function asymptotically approaches to the expected value of reward from the upper bound. If the value function is larger than the expected value under the asymptote, then the learning agent is almost sure to be able to obtain the optimal arm. This structure is called sand-sifter mechanism, which has no regrowth of value function of suboptimal arms. It means that the learning agent can play only the current best arm in each time step. Consequently the proposed method achieves high accuracy rate and low regret and some value functions of it can outperform ucb algorithms. This study suggests the advantage of optimism of agents in uncertain environment by one of the simplest frameworks. (C) 2015 The Authors. Published by Elsevier Ireland Ltd.
Many real-world stochastic environments are inherently multi-objective environments with conflicting objectives. The multi-objective multi-armed bandits (MOMAB) are extensions of the classical, i.e. single objective, ...
详细信息
ISBN:
(纸本)9781479945528
Many real-world stochastic environments are inherently multi-objective environments with conflicting objectives. The multi-objective multi-armed bandits (MOMAB) are extensions of the classical, i.e. single objective, multi-armed bandits to reward vectors and multi-objective optimisation techniques are often required to design mechanisms with an efficient exploration / exploitation trade-off. In this paper, we propose the improved Pareto Upper Confidence Bound (iPucb) algorithm that straightforwardly extends the single objective improved ucb algorithm to reward vectors by deleting the suboptimal arms. The goal of the improved Pareto ucb algorithm, i.e. iPucb, is to identify the set of best arms, or the Pareto front, in a fixed budget of arm pulls. We experimentally compare the performance of the proposed Pareto upper confidence bound algorithm with the Pareto ucb1 algorithm and the Hoeffding race on a bi-objective example coming from an industrial control applications, i.e. the engagement of wet clutches. We propose a new regret metric based on the Kullback-Leibler divergence to measure the performance of a multi-objective multi-armed bandit algorithm. We show that iPucb outperforms the other two tested algorithms on the given multi-objective environment.
暂无评论