We provide the theoretical lower bounds for these settings and the
regret upper bounds of the listed algorithms as the function of the
following key quantities: $K$ - number of arms, $T$ - number of rounds (time steps),
$d$ - dimension (where appropriate) and sometimes also additional
parameters pertinent to the assumptions on the bandits' structure. We shall also use the notations
$R_T$, $r_T$, and $g_T$ respectively for the cumulative regret, simple regret, and gain at time $T$.
Small number of arms
Cumulative Regret $E(R_T)$
Stochastic
Lower bound:
Under mild assumptions on the reward distribution,
asymptotic lower bound on expected cumulative regret
$\liminf_{t\rightarrow\infty}E(R_t)/\log(t)
\geq \sum_{i=1}^k \frac{\Delta_i}{ KL(P_i||P^*)}$.
ε-greedy |
|
$\liminf_{t\rightarrow\infty}E(R_t)/\log(t)
\geq \sum_{i=1}^k \frac{\Delta_i}{ KL(P_i||P^*)}$ |
Outperformed by UCB strategies in practice. |
UCB1 |
[Auer et al, 2002] |
$E(R_T)/\log T \le 8 \sum_{i:\mu_i \ne \mu^*}
\frac{8}{\Delta_i} + o(1)$ |
Over-explores. Does not exploit observed reward variance. |
UCB-V |
[Auer et al, 2002] |
$E(R_T)/\log T \le \sum_{i:\mu_i \ne \mu^*}
\left(\frac{\sigma_i^2}{\Delta_i} - 2b\right)$ |
The gain over UCB1 is small, but it allows for some control of the risk. |
KL-UCB |
[Garivier et al, 2011] |
$E(R_T)/\log T \le (1+\epsilon)\sum_{i:\mu_i \ne \mu^*}
\left(\frac{\Delta_i}{KL(P_i||P^*)}\right)+o(1)$ |
Outperforms UCB1, UCB-V. |
Thomson |
[Agrawal et al, 2011] |
|
Empirical performance is better than the UCB Strategies
and hints that it might provide better control over risk. |
FH-Gittins |
|
$E(R_T)/\log T \le \sum_{i:\mu_i \ne \mu^*}
32\left(\frac{1}{\Delta_i}\right)+o(1)$ |
Designed for Bayesian setting and requires knowledge
of the time horizon, but performs well
in comparison with the anytime strategies above. |
Adversarial $\Omega(\sqrt{TK})$
EXP |
[Auer et al, 2002] |
$O\left(\sqrt{TgK \log K}\right)$ |
One is able to drop
the standard stationarity assumptions on the arms, but one loses significantly
in performance. In this simple algorithm one requires to know $T$ in order to
set the parameters. |
EXP3.1 |
[Auer et al, 2002] |
$O\left(\sqrt{TG_{\max}K \log K}\right)$ |
A version of the above algorithm which does not require to know $T$ in advance. Other modifications for other scenarios exist. |
Bound $R_T$ with prob. $1-\delta$
PAC-UCB |
[Auer et al, 2002] |
$R_t\leq C$, $C>0$ is a constant dependending on the distributions of the arms and the parameters of the algorithm. |
|
Simple Regret $E(r_T)$
Uniform |
[Bubeck et al, 2009] |
$r_t\leq Ce^{-Dt}$ where $C,D>0$ are distribution
dependent. |
Empirically three regimes emerge. The Uniform strategy does better in the long run. |
UCB(p) |
[Bubeck et al, 2009] |
$r_t\leq Cn^{-Dt}$ where $C,D>0$ are distribution
dependent. |
However over the mid-term UCB strategies can be more efficient. Over very short time scales it is difficult to choose between them. |
Expected Discounted Reward
Gittins |
[Gittins et al, 1979] |
Optimal |
Achieves the maximum, robust to many types of stochastic arms. Implementation is often difficult and so complexity is difficult to specify. The strategy is not consistent. |
DP Solution |
|
Optimal |
Always possible, but computationally very inefficient. |
FH-Gittins |
[Nino-Mora, 2010] |
Suboptimal |
Computationally efficient in comparison with DP, but not
optimal. |
Large set of arms
Unstructured
Lower bound: $\forall \beta > 0, \mu^* \le 1:$ $\Omega(T^{\beta/1+\beta})$
UCB-AIR algorithm |
[Wang et al, 2008] |
$\tilde{O}(\sqrt{T})$ if $\beta < 1$ and $\mu^* < 1$
$\tilde{O}(T^{\beta/1+\beta})$ if $\beta \ge 1$ or $\mu^* = 1$ |
|
Linear: $\Omega(D\sqrt{T})$
Stochastic:
Confidence Elipsoid |
[Dani et al, 2008] |
$\tilde{O}(D\sqrt{T})$ |
Complexity $O(2^D)$ in general, but efficient for some special cases. |
Adversarial:
Self-conc. barrier |
[Abernethy, 2008] |
$O(D^{3/2}\sqrt{T}\log{T})$ |
Always possible, but computationally very inefficient. |
Geometric Hedge |
[Dani, 2008] |
$O(D^{3/2}\sqrt{T})$ |
Efficient algorithm exists if provided access to a certain optimization oracle (such as path planning). |
Convex $\Omega(D\sqrt{T})$
[
blog]
Lipschitz
Stochastic:
Convex + Lipschitz
Stochastic: $\Omega(\sqrt{DT})$
Adversarial: $\Omega(D\sqrt{T})$
Convex + Lipschitz + Lipschitz continuos gradient
Sahas's algo |
[Saha et. al, 2011] [video]
|
$O(D^{2/3}T^{2/3})$ |
$O$ hides the dependency on Lipschitz constant and the diameter of the space. |
Submodular