\n"; */ /*echo "\n"; */ ?> The Bandit Page

Bandit Learning            cs    pitt

We provide the theoretical lower bounds for these settings and the regret upper bounds of the listed algorithms as the function of the following key quantities: $K$ - number of arms, $T$ - number of rounds (time steps), $d$ - dimension (where appropriate) and sometimes also additional parameters pertinent to the assumptions on the bandits' structure. We shall also use the notations $R_T$, $r_T$, and $g_T$ respectively for the cumulative regret, simple regret, and gain at time $T$.

Small number of arms

Cumulative Regret $E(R_T)$


Lower bound: Under mild assumptions on the reward distribution, asymptotic lower bound on expected cumulative regret $\liminf_{t\rightarrow\infty}E(R_t)/\log(t) \geq \sum_{i=1}^k \frac{\Delta_i}{ KL(P_i||P^*)}$.
ε-greedy $\liminf_{t\rightarrow\infty}E(R_t)/\log(t) \geq \sum_{i=1}^k \frac{\Delta_i}{ KL(P_i||P^*)}$ Outperformed by UCB strategies in practice.
UCB1 [Auer et al, 2002] $E(R_T)/\log T \le 8 \sum_{i:\mu_i \ne \mu^*} \frac{8}{\Delta_i} + o(1)$ Over-explores. Does not exploit observed reward variance.
UCB-V [Auer et al, 2002] $E(R_T)/\log T \le \sum_{i:\mu_i \ne \mu^*} \left(\frac{\sigma_i^2}{\Delta_i} - 2b\right)$ The gain over UCB1 is small, but it allows for some control of the risk.
KL-UCB [Garivier et al, 2011] $E(R_T)/\log T \le (1+\epsilon)\sum_{i:\mu_i \ne \mu^*} \left(\frac{\Delta_i}{KL(P_i||P^*)}\right)+o(1)$ Outperforms UCB1, UCB-V.
Thomson [Agrawal et al, 2011] Empirical performance is better than the UCB Strategies and hints that it might provide better control over risk.
FH-Gittins $E(R_T)/\log T \le \sum_{i:\mu_i \ne \mu^*} 32\left(\frac{1}{\Delta_i}\right)+o(1)$ Designed for Bayesian setting and requires knowledge of the time horizon, but performs well in comparison with the anytime strategies above.

Adversarial $\Omega(\sqrt{TK})$

EXP [Auer et al, 2002] $O\left(\sqrt{TgK \log K}\right)$ One is able to drop the standard stationarity assumptions on the arms, but one loses significantly in performance. In this simple algorithm one requires to know $T$ in order to set the parameters.
EXP3.1 [Auer et al, 2002] $O\left(\sqrt{TG_{\max}K \log K}\right)$ A version of the above algorithm which does not require to know $T$ in advance. Other modifications for other scenarios exist.

Bound $R_T$ with prob. $1-\delta$

PAC-UCB [Auer et al, 2002] $R_t\leq C$, $C>0$ is a constant dependending on the distributions of the arms and the parameters of the algorithm.

Simple Regret $E(r_T)$

Uniform [Bubeck et al, 2009] $r_t\leq Ce^{-Dt}$ where $C,D>0$ are distribution dependent. Empirically three regimes emerge. The Uniform strategy does better in the long run.
UCB(p) [Bubeck et al, 2009] $r_t\leq Cn^{-Dt}$ where $C,D>0$ are distribution dependent. However over the mid-term UCB strategies can be more efficient. Over very short time scales it is difficult to choose between them.

Expected Discounted Reward

Gittins [Gittins et al, 1979] Optimal Achieves the maximum, robust to many types of stochastic arms. Implementation is often difficult and so complexity is difficult to specify. The strategy is not consistent.
DP Solution Optimal Always possible, but computationally very inefficient.
FH-Gittins [Nino-Mora, 2010] Suboptimal Computationally efficient in comparison with DP, but not optimal.

Large set of arms


Lower bound: $\forall \beta > 0, \mu^* \le 1:$ $\Omega(T^{\beta/1+\beta})$
UCB-AIR algorithm [Wang et al, 2008] $\tilde{O}(\sqrt{T})$ if $\beta < 1$ and $\mu^* < 1$
$\tilde{O}(T^{\beta/1+\beta})$ if $\beta \ge 1$ or $\mu^* = 1$

Linear: $\Omega(D\sqrt{T})$


Confidence Elipsoid [Dani et al, 2008] $\tilde{O}(D\sqrt{T})$ Complexity $O(2^D)$ in general, but efficient for some special cases.


Self-conc. barrier [Abernethy, 2008] $O(D^{3/2}\sqrt{T}\log{T})$ Always possible, but computationally very inefficient.
Geometric Hedge [Dani, 2008] $O(D^{3/2}\sqrt{T})$ Efficient algorithm exists if provided access to a certain optimization oracle (such as path planning).

Convex $\Omega(D\sqrt{T})$

Bandit Gradient Descend [Flaxman et al, 2005] $O(\sqrt{D}T^{5/6})$



$\mathcal{X}$-armed bandits [Bubeck et al, 2011] $\tilde{O}(T^\frac{d+1}{d+2})$

Convex + Lipschitz

Stochastic: $\Omega(\sqrt{DT})$

Agarwal's algo [Agarwal et al, 2011] $\tilde O(D^{16}\sqrt{T})$

Adversarial: $\Omega(D\sqrt{T})$

Bandit Gradient Descend [Flaxman et al, 2005] $O(\sqrt{D}T^{3/4})$

Convex + Lipschitz + Lipschitz continuos gradient

Sahas's algo [Saha et. al, 2011] [video] $O(D^{2/3}T^{2/3})$ $O$ hides the dependency on Lipschitz constant and the diameter of the space.


Hazan's algo [Hazan et al, 2009] $O(DT^{2/3})$