## Notes of Sebastien Bubeck’s course 13-03-2018

Bandit regret optimization

Blackboard course. Slides available on the PGMO website. Video available in a few weeks.

1. Introduction to regret analysis

1.1. Framework

At each time step ${t}$, the player selects an action ${i_t\in[n]}$ at random, according to a distribution ${p_t\in\Delta_n}$ which is independent of the past. Simultaneously, the adversary selects a loss ${l_t:[n]\rightarrow[0,1]}$. The loss suffered by the player is ${l_t(i_t)}$. Full information would mean that ${l_t\in {\mathbb R}^n}$ is observed. Bandit means that only ${l_t(i_t)}$ is observed.

The benchmark, to which other strategies will be compared, is

$\displaystyle \begin{array}{rcl} i^*_t=\mathrm{arg}\min_{i\in[n]} \mathop{\mathbb E}(\sum_{i=1}^n l_t(i)). \end{array}$

My loss ${L_T}$ must be compared with the optimal loss ${L^*_T}$,

$\displaystyle \begin{array}{rcl} L_T=\mathop{\mathbb E}(\sum_{t=1}^T l_t(i_t)),\quad L^*_T=\mathop{\mathbb E}(\sum_{t=1}^T l_t(i^*_t)). \end{array}$

The regret is the difference ${R_T=L_T-L^*_T}$.

The machine learning mindset is that a single picture does not help much, but a large dataset of actions allows for a good strategy. So ${[n]}$ is a large set, eventually with some structure.

Think of an oblivious adversary, like nature. Adversary merely means that we cannot make any stochastic assumption about it. For instance ${i^*}$ does not represent what a clever adversary would have done when player applies a constant strategy.

Interpret ${R_T/T}$ as the rate at which ${i^*}$ is identified. How large should one take ${T}$ in order that ${R_T/T<\epsilon}$?

If there is a good action, we intend to find it. If there is none,…

1.2. Some applications

Add placement. The player runs a website. Action = choose an add. Adversary is a potential customer, who clicks on the add or not. Best

Packet routing. Graph with source and target nodes. Action = choose a path. Huge set of actions, but with a lot of structure (distance between paths).

Hyperparameter optimization.

AI for games, i.e. efficient tree search. In AlphaGo, a determinist game, randomness was introduced to improve exploration of the set of actions.

1.3. Overview of results

\textbf{Full information with finite actions} There,

1. ${R_T\geq\Omega(\sqrt{T\log n})}$ for pure noise (case when the past gives no information).
2. There exists an algorithm which achieves ${R_T\leq O(\sqrt{T\log n})}$.

I will give two proofs, an algorithmic one, and a more conceptual one where ${\log n}$ appears as an entropy.

\textbf{Easy data} I.e. when there is more signal in the data than pure noise. Then ${R_T\leq O(\sqrt{L^*_T\log n})}$.

\textbf{Bandit with finite action} Then ${R_T=\Theta(\sqrt{Tn})}$. Bad news: in absence of structure, strategy is slow.

\textbf{Large and structured action set} This uses deeper mathematics (Bernoulli convolution, geometry of log-concave measures).

\textbf{Better benchmarks}

2. Full information

2.1. An algorithm: multiplicative weights

Rediscovered many times. See Vovk 1990, Freund-Shapira 1996, Littlestone-Warmuth 1994.

Assume oblivious adversary, and ${l_{i,t}:=l_t(i)\in\{ 0,1 \}}$.

Idea: keep a weight ${w_{i,t}\in{\mathbb R}_+}$ for each action ${i}$. Update

$\displaystyle \begin{array}{rcl} w_{i,t+1}=(1-\eta l_{i,t})w_{i,t}. \end{array}$

Play from ${p_t=\frac{w_t}{\|w_t\|_1}}$.

Key intuition: Say ${i^*}$ is correct at round ${t}$, then ${w_{i^*,t}}$ remains and others shrink, so ${p_{t+1}}$ gets closer to ${\delta_{i^*}}$. Otherwise, ${i^*}$ pays for it a fixed factor ${1-\eta}$.

Theorem 1

$\displaystyle \begin{array}{rcl} L_T< (1+\eta)L^*_T+\frac{\log\eta}{\eta}. \end{array}$

Optimizing in ${\eta}$ yields

$\displaystyle \begin{array}{rcl} R_T\leq 2\sqrt{L^*_T \log n}. \end{array}$

Proof Liapunov function based argument. Consider ${u_t=\|w_t\|_1}$. Then

$\displaystyle \begin{array}{rcl} u_{t+1}&=&\sum_i (1-\eta l_{i,t})w_{i,t}\\ &=&u_t(1-\eta\langle p_t,l_t\rangle)\\ &\leq&u_t e^{-\eta\langle p_t,l_t\rangle}, \end{array}$

hence

$\displaystyle \begin{array}{rcl} u_t\leq u_0 e^{-\eta L_t}, \end{array}$

since ${L_t=\mathop{\mathbb E}(\sum_t\langle p_t,l_t\rangle)}$. Note that ${u_0\leq n}$. On the other hand,

$\displaystyle \begin{array}{rcl} w_{i,T+1}=(1-\eta)^{L_{i,T}}, \end{array}$

hence

$\displaystyle \begin{array}{rcl} (1-\eta)^{L^*_T}\leq n e^{-\eta L_T}, \end{array}$

$\displaystyle \begin{array}{rcl} L_T\leq (1+\eta)L^*_T +\frac{\log\eta}{\eta}, \end{array}$

as announced. End of proof.

2.2. Lower bound (pure noise scenario)

Theorem 2 For every algorithm, there exists numbers ${l_1,\ldots,l_T\in[-T,T]^n}$ such that

$\displaystyle \begin{array}{rcl} R_T\geq \sqrt{2T\log n}(1+o(1)). \end{array}$

Proof. Probabilistic method. A lower bound on expected regret (with random losses) is given.

Pick ${l_{i,t}}$ at random from a Rademacher distribution (${\pm 1}$ uniformaly independently). Then ${\mathop{\mathbb E}(L_T)=0}$,

$\displaystyle \begin{array}{rcl} \mathop{\mathbb E}(L^*_T)=\mathop{\mathbb E}\max_{i\in[n]}\sum_t l_{i,t}\sim \sqrt{2T\log n} \end{array}$

according to the central limit theorem. Indeed, the max of ${T}$ independent gaussians has size ${\sqrt{T}}$. End of proof.

2.3. A more principled approach

Based on Abernethy-Warmuth 2008, Bubeck-Dehel-Koren-Peres 2015.

Say players picks ${a_t\in\mathcal{A}}$ and adversary picks ${l_1,\ldots,l_T\in\mathcal{L}^T}$.

A deterministic strategy is given by a map ${\underline{a_t}:\mathcal{L}^{t-1}\rightarrow\mathcal{A}}$.

A bandit is a map ${\underline{a_t}:[0,1]^{t-1}\rightarrow\mathcal{A}}$.

A randomized strategy is picking at random a deterministic strategy ${\Delta(\underline{\mathcal{A}})}$, where ${\underline{\mathcal{A}}}$ is the set of deterministic strategies. Kuhn’s theorem

The minimax regret consists in finding the best strategy, achieving

$\displaystyle \begin{array}{rcl} \inf_{\mu\in \Delta(\underline{\mathcal{A}})}\sup_{\underline{l}\in\mathcal{L}^T}\mathop{\mathbb E}_{\underline{a}\sim\mu}(R_T(\underline{a},\underline{l})) =\inf_\mu \sup_{\nu\in\Delta(\mathcal{L}^T)}\mathop{\mathbb E}_{\mu,\nu}(R_T(\underline{a},\underline{l})). \end{array}$

Sion’s minimax theorem allows to swap inf and sup: let ${\phi:X\times Y\rightarrow{\mathbb R}}$ be convex in ${x}$ and concave in ${y}$, bicontinuous, ${X}$ compact. Then

$\displaystyle \begin{array}{rcl} \inf_X \sup_Y \phi(x,y)=\sup_Y\inf_X \phi(x,y). \end{array}$

Now we are in Bayesian setting, with ${\nu}$ a prior distribution on ${(l_1,\ldots,l_T)}$. In particular, we know the distribution of ${i^*}$.

Denote by ${\mathop{\mathbb E}_t:=\mathop{\mathbb E}[\cdot|l_1,\ldots,l_{t-1}]}$. By convexity of the simplex, we choose to play ${p_t:=\mathop{\mathbb E}_t \delta_{i^*}}$, a Doob martingale. In other words, ${p_t(i)=\mathop{\mathbb P}(i^*=i|l_1,\ldots,l_{t-1})}$. Let us analyze it.

$\displaystyle \begin{array}{rcl} \mathop{\mathbb E}(\sum_t \langle p_t-\delta_{i^*},l_t \rangle) &=&\mathop{\mathbb E}(\sum_t \langle p_t-p_{t+1},l_t \rangle)\\ &\leq&\mathop{\mathbb E}(\sum_t\|p_t-p_{t+1}\|_1). \end{array}$

Each move of a martingale costs entropy. This is encoded in the following folklore inequality,

$\displaystyle \begin{array}{rcl} \mathop{\mathbb E}(\sum_t\|p_t-p_{t+1}\|_1^2)\leq 2\log n. \end{array}$

Assuming this bound, use Cauchy-Schwarz,

$\displaystyle \begin{array}{rcl} \mathop{\mathbb E}(\sum_t\|p_t-p_{t+1}\|_1)\leq \sqrt{T\mathop{\mathbb E}(\sum_t\|p_t-p_{t+1}\|_1^2)}\leq \sqrt{2T\log n}, \end{array}$

which is the optimal bound. End of proof.

The bandit case will be more subtle, since one will not be allowed to condition with respect to ${l_1,\ldots,l_T}$.

2.4. Proof of the folklore inequality

Recall the entropy of a finite distribution ${H(p)=\sum_i -p_i \log p_i}$. The relative entropy is

$\displaystyle \begin{array}{rcl} Re(p,q)=-\sum_i p_i \log q_i -H(p). \end{array}$

Given finite random variables ${X}$ and ${Y}$, the entropy of ${X}$ knowing ${Y}$ is

$\displaystyle \begin{array}{rcl} H(X|Y)=\sum_y p_y(y)H(p_{x|y}). \end{array}$

The mutual information means how useful is ${Y}$ when predicting ${X}$,

$\displaystyle \begin{array}{rcl} I(X,Y)=H(X)-H(X|Y)=\sum_y p_y(y) Re(p_{x|y},p_x) \end{array}$

It is symmetric. Pinsker ‘s inequality states that

$\displaystyle \begin{array}{rcl} \frac{1}{2}\|p-q\|_1\leq \sqrt{Re(p,q)}. \end{array}$

Proof of the folklore inequality. Apply Pinsker’s inequality,

$\displaystyle \begin{array}{rcl} \mathop{\mathbb E}(\frac{1}{2}\sum_t \|p_t-p_{t+1}\|_1^2)\leq \mathop{\mathbb E}(\sum_t Re(p_{t+1},p_t)). \end{array}$

Since

$\displaystyle \begin{array}{rcl} \mathop{\mathbb E}_t(Re(p_{t+1},p_t))&=&\mathop{\mathbb E}_t(Re(p^t_{i^*|l_t},p_t))\\ &=&I_t(i^*,l_t)\\ &=&H_t(i^*)-H_t(i^*|l_t)\\ &=&H_t(i^*)-\mathop{\mathbb E}_t(H_{t+1}(t^*)). \end{array}$

Hence

$\displaystyle \begin{array}{rcl} \mathop{\mathbb E}(\frac{1}{2}\sum_t \|p_t-p_{t+1}\|_1^2)\leq H_1(i^*). \end{array}$

3. Bandit analysis

Based on Russo-Van Roy 2014.

Back to the instantaneous regret ${r_t=\mathop{\mathbb E}_t\langle p_t-\delta_{i^*},l_t\rangle}$. We used

$\displaystyle \begin{array}{rcl} \mathop{\mathbb E}(\sum_t r_t)=\mathop{\mathbb E}(\sum_t\langle p_t-p_{t+1},l_t\rangle. \end{array}$

This is not true any more in the bandit setting, where the filtration is much smaller. Nevertheless, we hope that a weaker inequality would suffice,

$\displaystyle \begin{array}{rcl} r_t\leq \sqrt{I_t(i^*,(i_t,l_t(i_t)))}. \end{array}$

The next argument goes back to the very first paper on bandit theory, Thompson Sampling 1933. Bandit is a trade off between exploration and exploitation. For future benefit, it is necessary to explore more.

Denote ${\bar l_t(i)=\mathop{\mathbb E}_t(l_t(i))}$, ${\bar l_t(i,j)=\mathop{\mathbb E}_t(l_t(i)|i^*=j)}$. Then

$\displaystyle \begin{array}{rcl} r_t=\mathop{\mathbb E}_t\langle p_t-\delta_{i^*},l_t\rangle &=&\sum_i p_t(i)(\bar l_t(i)-\bar l_t(i,i)). \end{array}$

On the other hand,

$\displaystyle \begin{array}{rcl} I_t(i^*,(i_t,l_t(i_t))) &=&H_t(i_t,l_t(i_t))-H_t((i_t,l_t(i_t))|i^*)\\ &\geq& H_t(l_t(i_t)|i_t)-H_t(l_t(i_t)|i^*,i_t)\\ &=&\sum_i p_t(i) I_t(l_t(i)|i^*)\\ &=&\sum_i p_t(i) Re(l_t(i)|i^*,l_t(i))\\ &=&\sum_{i,j} p_t(i)p_t(j) Re(l_t(i)|i^*=j,l_t(i))\\ &\geq&\sum_{i,j} p_t(i)p_t(j) (\bar l_t(i)-\bar l_t(i,j))^2, \end{array}$

where the last step is Pinsker’s inequality, plus the fact that ${l_t(i)\in[-1,1]}$. Cauchy-Schwarz inequality gives

$\displaystyle \begin{array}{rcl} r_t\leq \sqrt{n I_t(i^*,(i_t,l_t(i_t)))}=\sqrt{n}\sqrt{H_t(i^*)-H_{t+1}(i^*)}. \end{array}$

Plugging this in the argument of previous section, one gets

Theorem 3 In the bandit setting,

$\displaystyle \begin{array}{rcl} \inf \sup R_T\leq 2\sqrt{Tn\log n}. \end{array}$

The ${\log n}$ term can be removed.

On the lower bound side, the adversary is picked at random: one is ${Ber(\frac{1}{2}-\epsilon)}$, the others being ${Ber(\frac{1}{2})}$. One needs ${4/\epsilon^2}$ samples per arm, whence ${\frac{1}{\epsilon^2}=\frac{T}{n}}$, regret is ${\sim \epsilon T\sim \sqrt{nT}}$.

4. A primer on bandit linear optimization

Now ${\mathcal{A}\subset {\mathbb R}^n}$, ${l_t\in{\mathbb R}^n}$, and the loss of playing ${a_{i_t}}$ is the inner product ${a_{i_t}\cdot l_t}$. Naively, one would assume that regret is ${\sim\sqrt{T|\mathcal{A}|}}$.

Theorem 4

$\displaystyle \begin{array}{rcl} \inf \sup\mathop{\mathbb E} R_T \leq \sqrt{Tn\log|\mathcal{A}|}. \end{array}$

This is best possible.

If ${\mathcal{A}}$ is infinite but included in the unit ball of ${{\mathbb R}^n}$ for some norm. Then

Theorem 5

$\displaystyle \begin{array}{rcl} \inf \sup\mathop{\mathbb E} R_T =\Theta(T^{1-1/q}), \end{array}$

where ${q}$ is the cotype of the chosen norm on ${{\mathbb R}^n}$.

4.1. Proof of theorem

The last application of Cauchy-Schwarz needs be replaced by something smarter. Here

$\displaystyle \begin{array}{rcl} \bar l_t(i)=\mathop{\mathbb E}_t(l_t(i))=\bar l_t\cdot a_i,\quad \bar l_t(i,j)=\bar l_t^j\cdot a_i. \end{array}$

Introduce the (huge) ${|\mathcal{A}|\times |\mathcal{A}|}$ matrix

$\displaystyle \begin{array}{rcl} M_{i,j}=\sqrt{p_t(i)p_t(j)}(\bar l_t-\bar l_t^j)\cdot a_i . \end{array}$

One needs relate the trace of this matrix to its Frobenius norm. In general,

$\displaystyle \begin{array}{rcl} trace(M)=\sum \lambda_i(M)\leq\sqrt{rank(M)(\sum_i \lambda_i^2(M)}=\sqrt{rank(M)}\|M\|_F. \end{array}$

Thus we win by replacing the size ${|\mathcal{A}|}$ with the rank of matrix ${M}$. Here ${M=AB}$ where ${A}$ has size ${n}$, hence ${M}$ has rank ${\leq n}$.

5. Mirror descent and online decision making

We focus on online linear optimization.

Data: a compact body ${K\subset{\mathbb R}^n}$. At each time step ${t\in[T]}$, the player picks ${x_t\in K}$ and the adversary picks ${l_t\in{\mathbb R}^n}$ (previously, ${K}$ was the simplex) such that ${|l_t\cdot x|\leq 1}$ for all ${x\in K}$. My regret is

$\displaystyle \begin{array}{rcl} R_T=\max_{x\in K} \sum_t l_t\cdot(x_t -x). \end{array}$

5.1. Stability as a algorithm design principle

We estimate

$\displaystyle \begin{array}{rcl} R_T\leq \sum_t\|x_t-x_{t+1}\|+\sum_t l_t\cdot(x_{t+1}-x) \end{array}$

The first term is a movement cost. Call the second term the one look ahead regret. The algorithm will find a trade-off between the two terms.

This is ${x_{t+1}=x_t-\eta l_t}$. In a more invariant manner,

$\displaystyle \begin{array}{rcl} x_{t+1}=\mathrm{arg}\min_{x\in{\mathbb R}^n} l_t\cdot x+\frac{1}{2\eta}\|x_t -x\|_2^2. \end{array}$

Note the ${\ell^2}$ norm is not well suited to our problem, which involves an ${\ell^2}$ norm. Assuming that ${x_1=0}$, a third interpretation, in the spirit of regularization, is

$\displaystyle \begin{array}{rcl} x_{t+1}=\mathrm{arg}\min_{x\in{\mathbb R}^n} \sum_{s=1}^t l_s\cdot x+\frac{1}{2\eta}\|x\|_2^2. \end{array}$

We try to avoid overfit, i.e. control entropy. Hence we replace ${\ell^2}$ norm with entropy. Also, instead of inner product, duality should be used.

5.3. Mirror descent

Initially due to Nemirovskii, see book, by Nemirovskii and Yudin, 1983.

Le ${\phi:\mathcal{D}\supset K\rightarrow{\mathbb R}}$ be a convex function. Write

$\displaystyle \begin{array}{rcl} D_\phi(x,y)=\phi(x)-\phi(y)-\nabla\phi(y)\cdot(x-y). \end{array}$

Introduce the Fenchel dual function on the dual space,

$\displaystyle \begin{array}{rcl} \phi^*(\theta)=\sup_{x\in{\mathbb R}^n}\theta(x)-\phi(x). \end{array}$

Under mild assumptions, ${\phi^{**}=\phi}$.

Given ${x_t\in K}$, ${\nabla\phi(x_t)}$ belongs to the dual space. Move it in the direction of ${l_t}$, get ${\nabla\phi(x_t)-\eta l_t}$. Get back to the primal space by applying ${\nabla\phi^*}$, get

$\displaystyle \begin{array}{rcl} z_{t+1}:=\nabla\phi^*(\nabla\phi(x_t)-\eta l_t). \end{array}$

Maybe this does not belong to ${K}$. Thus project it to ${K}$ (in Euclidean distance). In other words,

$\displaystyle \begin{array}{rcl} x_{t+1}=\mathrm{arg}\min_{x\in K}D_\phi(x,\nabla\phi^*(\nabla\phi(x_t)-\eta l_t). \end{array}$

5.4. Analysis of the one look ahead term

For intuition, I explain the continuous setting. Then ${t\in {\mathbb R}}$, ${x_t}$ is defined using ${l_s}$ for all ${s\in[0,t]}$. Hence the one look ahead term becomes

$\displaystyle \begin{array}{rcl} \int_0^T l_t \cdot(x(t)-t)\,dt \end{array}$

whereas the movement cost becomes ${\int_0^T \|x'(t)\|\,dt}$. The continuous time mirror descent is

$\displaystyle \begin{array}{rcl} x_{t+\epsilon}=\mathrm{arg}\min_{x\in K} D_\phi(x,\nabla\phi^*(\nabla\phi(x(t))-\epsilon\eta l(t)). \end{array}$

The constrained optimality condition at a point ${x\in\partial K}$ involves covectors ${\theta}$ in the normal cone

$\displaystyle \begin{array}{rcl} N_K(x)=\{\theta\,;\,\forall y\in K,\,\theta\cdot (y-x)\leq 0\}. \end{array}$

It states that ${-\nabla f(x^*)\in N_K(x^*)}$. The gradient of ${x\mapsto D_\phi(x,\nabla\phi^*(\nabla\phi(x(t))-\epsilon\eta l(t))}$ is ${\nabla\phi(x)-\nabla\phi(x(t))+\epsilon\eta l(t)}$. Hence the minimizer ${x_{t+\epsilon}}$ satisfies

$\displaystyle \begin{array}{rcl} \nabla\phi(x_{t+\epsilon})-\nabla\phi(x(t))+\epsilon\eta l(t)\in -N_K(x_{t+\epsilon}), \end{array}$

hence the differential inclusion

$\displaystyle \begin{array}{rcl} \nabla^2 \phi(x(t))x'(t)+\eta l(t)\in N_K(x(t)). \end{array}$

Theorem 6 (Bubeck-Cohen-Lee…2017) If ${l}$ is continuous, ${\phi}$ is semicontinuous and ${\nabla^2\phi}$ is continuous, the differential inclusion has a unique solution.

As a Liapunov function, we use

$\displaystyle \begin{array}{rcl} D_\phi(y,x(t))=\phi(y)-\phi(x(t))-\nabla\phi(x(t))\cdot(y-x(t)). \end{array}$

Let us compute the time derivative

$\displaystyle \begin{array}{rcl} \partial_t D_\phi(y,x(t))&=&-\nabla\phi(x(t))\cdot x'(t)-\nabla\phi(x(t))\cdot(-x'(t))-\nabla^2\phi(x(t))x'(t)\cdot(y-x(t))\\ &=&(\eta l(t)+\lambda(t))\cdot(y-x(t))\\ &\leq&\eta l(t)\cdot(y-x(t)) \end{array}$

since the Lagrange multiplier ${\lambda(t)}$ is nonpositive on the normal cone. Integrating over time, we get

Lemma 7

$\displaystyle \begin{array}{rcl} \int l(t)\cdot(x(t)-y)\,dt\leq \frac{1}{\eta}D_\phi(y,x(t)). \end{array}$

Recall the multiplicative weights bound

$\displaystyle \begin{array}{rcl} R_T\leq \frac{\log\eta}{\eta}+\eta L^*_T. \end{array}$

We have obtained the estimate corresponding to ${\frac{\log\eta}{\eta}}$. Next, me move to the other term.

5.5. Movement analysis

Nothing general, here ${K=\Delta_n}$ is the simplex equipped with the ${\ell^1}$ norm. We hope for an estimate

$\displaystyle \begin{array}{rcl} \|x't)\|_1\leq\eta l(t)\cdot y. \end{array}$

A stronger estimate would be

$\displaystyle \begin{array}{rcl} \|x't)\|_1\leq\eta l(t)\cdot x(t). \end{array}$

Here, ${x'(t)=\nabla^2\phi(x(t))^{-1}(\eta l(t)+\lambda(t))}$. We may choose a suitable convex function ${\phi}$. To simplify matters, let us ignore ${\lambda(t)}$ and assume ${\phi(x)=\sum_i\phi(x_i)}$. Then

$\displaystyle \begin{array}{rcl} \|\nabla^2\phi(x(t))^{-1}(l(t))\|_1=\sum_i\frac{l_i(t)}{\phi''(x_i)} \end{array}$

which is indeed estimated by ${l(t)\cdot x(t)}$ provided we choose ${\phi(x)=x\log x}$. With this choice, the dynamics becomes

$\displaystyle \begin{array}{rcl} x'_i(t)=-x_i(t)(\eta l_i(t)+\mu(t), \end{array}$

where ${\mu}$ is such that ${\sum_i x'_i(t)=0}$. This proves the wanted estimate.

5.6. Discrete time analysis

We expect an extra additive error of ${\partial_t^2 D_\phi(y,x(t))}$ which, up to a term in ${\lambda(t)}$, equals

$\displaystyle \begin{array}{rcl} \partial_t \eta l(t)\cdot(y-x(t))&=&-\eta l(t)\cdot x'(t)\\ &=&\eta^2 l(t)\cdot \nabla^2\phi(x(t))^{-1}l(t)\\ &=&\eta^2\|l(t)\|^2_{x(t),*}. \end{array}$

In the last expression, the local norm is defined as follows. The square root of the Hessian of ${\phi}$ defines a Riemannian metric on ${K}$ (which blows up near the boundary).

Theorem 8 Assume that ${\phi}$ is well-conditionned in the sense that if ${\nabla\phi(y_t)\in[\nabla\phi(x_t)-\eta l_t,\nabla\phi(x_t)]}$, then

$\displaystyle \begin{array}{rcl} \nabla^2\phi(y_t)\geq c\nabla^2\phi(x(t)). \end{array}$

Then the regret is bounded above by

$\displaystyle \begin{array}{rcl} \sum_t l_t\cdot(x_t -y)\leq\frac{1}{\eta}D_\phi(y,x_1)+\frac{2\eta}{c}\sum_t \|l_t\|^2_{x_t,*}. \end{array}$

Proof. Recall that ${\nabla\phi(z_{t+1})=\nabla\phi(x_t)-\eta l_t}$ and

$\displaystyle \begin{array}{rcl} x_{t+1}=\mathrm{arg}\min_{x\in K}D_\phi(x,z_{t+1}). \end{array}$

We must estimate

$\displaystyle \begin{array}{rcl} \eta l_t\cdot(x_t-y)&=&\nabla\phi(x_t)\cdot(x_t-y)-\nabla\phi(z_{t+1})\cdot (x_t-x_{t+1})-\nabla\phi(z_{t+1})\cdot(x_{t+1}-y). \end{array}$

The last term is ${\geq -\nabla\phi(x_{t+1})\cdot(x_{t+1}-y)}$ by optimality. On the other hand,

$\displaystyle \begin{array}{rcl} D_\phi(y,x_{t})-D_\phi(y,x_{t+1}) &=&\phi(x_t)+\phi(x_{t+1})-\nabla\phi(x_t)\cdot(y-x(t))+\nabla\phi(x_{t+1})\cdot(y-x_{t+1}). \end{array}$

Hence, using convexity of ${\phi}$,

$\displaystyle \begin{array}{rcl} \eta l_t\cdot(x_t-y)&\leq&D_\phi(y,x_t)-D_\phi(y,x_{t+1})+D_\phi(x_t,z_{t+1}). \end{array}$

By the magic of Fenchel transform,

$\displaystyle \begin{array}{rcl} D_\phi(x_t,z_{t+1})=D_{\phi^*}(\nabla\phi(z_{t+1}),\nabla\phi(x_t)). \end{array}$

In turn, this last term is of the order of ${\|\eta l_t\|^2_{x_t,*}}$. Summing up yields the Theorem.

5.7. Revisiting multiplicative weights with mirror descent

Take ${\phi(x)=\sum x_i\log x_i}$, whose Hessian is the diagonal quadratic form with eigenvalues ${1/x_i}$. The Riemannian metric is ${\|h\|^2_{x,*}=\sum_i x_i h_i^2}$. Then

$\displaystyle \begin{array}{rcl} \nabla\phi(x_t)-\eta l_t=\nabla\phi(z_{t+1})\iff z_{i,t+1}=x_{i,t}e^{-\eta l_{i,t}}. \end{array}$

Hence

$\displaystyle \begin{array}{rcl} x_{i,t+1}=\frac{x_{i,t}e^{-\eta l_{i,t}}}{\|x_{i,t}e^{-\eta l_{i,t}}\|_1}. \end{array}$

Corollary 9 If ${l_{i,t}\in[0,1]}$, then well-conditionning holds with ${c=1}$.

The Theorem yields

$\displaystyle \begin{array}{rcl} \sum_t l_t\cdot(x_t-y)\leq \frac{1}{\eta}D_\phi(y,x_1) \end{array}$

5.8. Propensity score estimate for bandits

Again, we do not observe the full loss function, and need an unbiaised estimator for it. We look for ${\hat l_t\in{\mathbb R}^n}$ depending on ${l_t(i_t)}$ where ${i_t}$ is pick according to ${p_t}$. Start with putting to ${0}$ everything which has not been observed,

$\displaystyle \begin{array}{rcl} \hat l_t(i)=\frac{l_t(i)1_{\{i=i_t\}}}{p_t(i)}. \end{array}$

Then run mirror descent with ${\hat l_t}$ instead of ${l_t}$, get the Exp3 algorithm of Auer-Cesa-Bianchi-Freund-Schapiro 1995).

Theorem 10 Exp3 satisfies

$\displaystyle \begin{array}{rcl} R_T \leq \frac{\log(\eta)}{\eta}+\eta nT\leq \sqrt{Tn\log n}. \end{array}$

Proof.

$\displaystyle \begin{array}{rcl} \mathop{\mathbb E}\langle p_t,\hat l_t^2 \rangle &\leq&\mathop{\mathbb E}\sum_i p_t(i)\frac{1_{\{i=i_t\}}}{p_t(i)^2}\\ &=&\mathop{\mathbb E}\sum_i\frac{1_{\{i=i_t\}}}{p_t(i)}=n. \end{array}$

Note that the variance is finite only beause local norms are used.

5.9. More refined regularization

Change function ${\phi}$ to make variance smaller. Negentropy gives ${\nabla^2 \phi(x)^{-1}=diag(x_i)}$. Instead, ${\phi(x)=\sum x_i^{1-\epsilon}}$ gives ${\nabla^2 \phi(x)^{-1}=diag(x_i^{1=\epsilon})}$. Then

$\displaystyle \begin{array}{rcl} \mathop{\mathbb E}\sum_i p_t(i)^{1+\epsilon}\frac{1_{\{i=i_t\}}}{p_t(i)^2} =\sum_i p_t(i)^\epsilon\leq n^{1-\epsilon}, \end{array}$

which is slightly better. On the other hand,

$\displaystyle \begin{array}{rcl} -\phi(x)=\sum_i x_i^{1-\epsilon}\leq n^\epsilon. \end{array}$

Plugging this in the Theorem gives regret bound ${R_T\leq\sqrt{nT}}$ (Audibert-Bubeck 2009).

In view of easy data bounds, pick ${\phi(x)=-\sum_i \log x_i}$ (called the log barrier). Then ${\nabla^2 \phi(x)^{-1}=diag(x_i^2)}$. The local norms cancel, yielding

$\displaystyle \begin{array}{rcl} \mathop{\mathbb E}\|\hat l_t\|^2_{p_t,*}=\langle p_t,l_t^2 \rangle \leq \langle p_t,l_t \rangle. \end{array}$

Plugged in the Theorem, this gives regret bound ${R_T\leq\sqrt{nL_T^* \log T}}$.

Based on Borodin-Linial-Saks 1982. One moves among states, with a task attached to each state. At each time step, observe ${l_t:[n]\rightarrow {\mathbb R}_+}$. Update state ${i_t\in[n]}$ to ${i_{t+1}}$. Pay movement plus task, ${d(i_t,i_{t+1})+l_t(i_{t+1})}$.

It is hopeless to try to do as well as the best guy. The ibjective is to do it up to a multiplicative constant. So he regret is replaced with a competitive ratio

$\displaystyle \begin{array}{rcl} \frac{\sum_t d(i_t,i_{t+1})+l_t(i_{t+1})}{\sum_t d(i^*_t,i^*_{t+1})+l_t(i^*_{t+1})}, \end{array}$

where ${i^*_1,\ldots,i^*_T}$ are the optimizers.

If ${i_t}$ is picked according to distribution ${p_t}$, then ${\mathop{\mathbb E} d(i_t,i_{t+1})=W_1(p_t,p_{t+1})}$ is Wasserstein distance, and ${\mathop{\mathbb E}(l_t(i_{t+1}))=\langle p_{t+1},l_t \rangle}$ is a one look ahead term.

Example. If all distances are equal to 1, ${W_1}$ is the ${\ell^1}$ distance on distributions.

Our convex function ${\phi}$ should be ${L}$-Lipschitz with respect to the Wasserstein distance.

Theorem 11 (Bubeck-Cohen-Lee… 2018) The competitive ratio of mirror descent is bounded above by ${L}$ times the terme arising from mouvement cost analysis.

Theorem 12 (Borodin-Linial-Saks 1982) For the trivial distance, the competitive ratio is at least ${\log n}$.

Conjecture. This is true for any metric.

My intuition is that the metric can only help us. Furthermore, there is an avatar of entropy, called multiscale entropy, associated with any metric. Usual entropy is well suited only to the trivial distance.