** Bandit regret optimization **

Blackboard course. Slides available on the PGMO website. Video available in a few weeks.

**1. Introduction to regret analysis **

** 1.1. Framework **

At each time step , the player selects an action at random, according to a distribution which is independent of the past. Simultaneously, the adversary selects a loss . The loss suffered by the player is . Full information would mean that is observed. Bandit means that only is observed.

The benchmark, to which other strategies will be compared, is

My loss must be compared with the optimal loss ,

The regret is the difference .

The machine learning mindset is that a single picture does not help much, but a large dataset of actions allows for a good strategy. So is a large set, eventually with some structure.

Think of an oblivious adversary, like nature. Adversary merely means that we cannot make any stochastic assumption about it. For instance does not represent what a clever adversary would have done when player applies a constant strategy.

Interpret as the rate at which is identified. How large should one take in order that ?

If there is a good action, we intend to find it. If there is none,…

** 1.2. Some applications **

Add placement. The player runs a website. Action = choose an add. Adversary is a potential customer, who clicks on the add or not. Best

Packet routing. Graph with source and target nodes. Action = choose a path. Huge set of actions, but with a lot of structure (distance between paths).

Hyperparameter optimization.

AI for games, i.e. efficient tree search. In AlphaGo, a determinist game, randomness was introduced to improve exploration of the set of actions.

** 1.3. Overview of results **

\textbf{Full information with finite actions} There,

- for pure noise (case when the past gives no information).
- There exists an algorithm which achieves .

I will give two proofs, an algorithmic one, and a more conceptual one where appears as an entropy.

\textbf{Easy data} I.e. when there is more signal in the data than pure noise. Then .

\textbf{Bandit with finite action} Then . Bad news: in absence of structure, strategy is slow.

\textbf{Large and structured action set} This uses deeper mathematics (Bernoulli convolution, geometry of log-concave measures).

\textbf{Better benchmarks}

**2. Full information **

** 2.1. An algorithm: multiplicative weights **

Rediscovered many times. See Vovk 1990, Freund-Shapira 1996, Littlestone-Warmuth 1994.

Assume oblivious adversary, and .

Idea: keep a weight for each action . Update

Play from .

Key intuition: Say is correct at round , then remains and others shrink, so gets closer to . Otherwise, pays for it a fixed factor .

Theorem 1

Optimizing in yields

**Proof** Liapunov function based argument. Consider . Then

hence

since . Note that . On the other hand,

hence

as announced. End of proof.

** 2.2. Lower bound (pure noise scenario) **

Theorem 2For every algorithm, there exists numbers such that

**Proof**. Probabilistic method. A lower bound on expected regret (with random losses) is given.

Pick at random from a Rademacher distribution ( uniformaly independently). Then ,

according to the central limit theorem. Indeed, the max of independent gaussians has size . End of proof.

** 2.3. A more principled approach **

Based on Abernethy-Warmuth 2008, Bubeck-Dehel-Koren-Peres 2015.

Say players picks and adversary picks .

A deterministic strategy is given by a map .

A bandit is a map .

A randomized strategy is picking at random a deterministic strategy , where is the set of deterministic strategies. Kuhn’s theorem

The minimax regret consists in finding the best strategy, achieving

Sion’s minimax theorem allows to swap inf and sup: let be convex in and concave in , bicontinuous, compact. Then

Now we are in Bayesian setting, with a prior distribution on . In particular, we know the distribution of .

Denote by . By convexity of the simplex, we choose to play , a Doob martingale. In other words, . Let us analyze it.

Each move of a martingale costs entropy. This is encoded in the following folklore inequality,

Assuming this bound, use Cauchy-Schwarz,

which is the optimal bound. End of proof.

The bandit case will be more subtle, since one will not be allowed to condition with respect to .

** 2.4. Proof of the folklore inequality **

Recall the entropy of a finite distribution . The relative entropy is

Given finite random variables and , the entropy of knowing is

The mutual information means how useful is when predicting ,

It is symmetric. Pinsker ‘s inequality states that

**Proof of the folklore inequality**. Apply Pinsker’s inequality,

Since

Hence

**3. Bandit analysis **

Based on Russo-Van Roy 2014.

Back to the instantaneous regret . We used

This is not true any more in the bandit setting, where the filtration is much smaller. Nevertheless, we hope that a weaker inequality would suffice,

The next argument goes back to the very first paper on bandit theory, Thompson Sampling 1933. Bandit is a trade off between exploration and exploitation. For future benefit, it is necessary to explore more.

Denote , . Then

On the other hand,

where the last step is Pinsker’s inequality, plus the fact that . Cauchy-Schwarz inequality gives

Plugging this in the argument of previous section, one gets

Theorem 3In the bandit setting,

The term can be removed.

On the lower bound side, the adversary is picked at random: one is , the others being . One needs samples per arm, whence , regret is .

**4. A primer on bandit linear optimization **

Now , , and the loss of playing is the inner product . Naively, one would assume that regret is .

Theorem 4

This is best possible.

If is infinite but included in the unit ball of for some norm. Then

Theorem 5

where is the cotype of the chosen norm on .

** 4.1. Proof of theorem **

The last application of Cauchy-Schwarz needs be replaced by something smarter. Here

Introduce the (huge) matrix

One needs relate the trace of this matrix to its Frobenius norm. In general,

Thus we win by replacing the size with the rank of matrix . Here where has size , hence has rank .

**5. Mirror descent and online decision making **

We focus on online linear optimization.

Data: a compact body . At each time step , the player picks and the adversary picks (previously, was the simplex) such that for all . My regret is

** 5.1. Stability as a algorithm design principle **

We estimate

The first term is a movement cost. Call the second term the one look ahead regret. The algorithm will find a trade-off between the two terms.

** 5.2. Gradient descent **

This is . In a more invariant manner,

Note the norm is not well suited to our problem, which involves an norm. Assuming that , a third interpretation, in the spirit of regularization, is

We try to avoid overfit, i.e. control entropy. Hence we replace norm with entropy. Also, instead of inner product, duality should be used.

** 5.3. Mirror descent **

Initially due to Nemirovskii, see book, by Nemirovskii and Yudin, 1983.

Le be a convex function. Write

Introduce the Fenchel dual function on the dual space,

Under mild assumptions, .

Given , belongs to the dual space. Move it in the direction of , get . Get back to the primal space by applying , get

Maybe this does not belong to . Thus project it to (in Euclidean distance). In other words,

** 5.4. Analysis of the one look ahead term **

For intuition, I explain the continuous setting. Then , is defined using for all . Hence the one look ahead term becomes

whereas the movement cost becomes . The continuous time mirror descent is

The constrained optimality condition at a point involves covectors in the normal cone

It states that . The gradient of is . Hence the minimizer satisfies

hence the differential inclusion

Theorem 6 (Bubeck-Cohen-Lee…2017)If is continuous, is semicontinuous and is continuous, the differential inclusion has a unique solution.

As a Liapunov function, we use

Let us compute the time derivative

since the Lagrange multiplier is nonpositive on the normal cone. Integrating over time, we get

Lemma 7

Recall the multiplicative weights bound

We have obtained the estimate corresponding to . Next, me move to the other term.

** 5.5. Movement analysis **

Nothing general, here is the simplex equipped with the norm. We hope for an estimate

A stronger estimate would be

Here, . We may choose a suitable convex function . To simplify matters, let us ignore and assume . Then

which is indeed estimated by provided we choose . With this choice, the dynamics becomes

where is such that . This proves the wanted estimate.

** 5.6. Discrete time analysis **

We expect an extra additive error of which, up to a term in , equals

In the last expression, the local norm is defined as follows. The square root of the Hessian of defines a Riemannian metric on (which blows up near the boundary).

Theorem 8Assume that is well-conditionned in the sense that if , then

Then the regret is bounded above by

**Proof**. Recall that and

We must estimate

The last term is by optimality. On the other hand,

Hence, using convexity of ,

By the magic of Fenchel transform,

In turn, this last term is of the order of . Summing up yields the Theorem.

** 5.7. Revisiting multiplicative weights with mirror descent **

Take , whose Hessian is the diagonal quadratic form with eigenvalues . The Riemannian metric is . Then

Hence

Corollary 9If , then well-conditionning holds with .

The Theorem yields

** 5.8. Propensity score estimate for bandits **

Again, we do not observe the full loss function, and need an unbiaised estimator for it. We look for depending on where is pick according to . Start with putting to everything which has not been observed,

Then run mirror descent with instead of , get the Exp3 algorithm of Auer-Cesa-Bianchi-Freund-Schapiro 1995).

Theorem 10Exp3 satisfies

**Proof**.

Note that the variance is finite only beause local norms are used.

** 5.9. More refined regularization **

Change function to make variance smaller. Negentropy gives . Instead, gives . Then

which is slightly better. On the other hand,

Plugging this in the Theorem gives regret bound (Audibert-Bubeck 2009).

In view of easy data bounds, pick (called the log barrier). Then . The local norms cancel, yielding

Plugged in the Theorem, this gives regret bound .

** 5.10. Metrical task systems **

Based on Borodin-Linial-Saks 1982. One moves among states, with a task attached to each state. At each time step, observe . Update state to . Pay movement plus task, .

It is hopeless to try to do as well as the best guy. The ibjective is to do it up to a multiplicative constant. So he regret is replaced with a competitive ratio

where are the optimizers.

If is picked according to distribution , then is Wasserstein distance, and is a one look ahead term.

**Example**. If all distances are equal to 1, is the distance on distributions.

Our convex function should be -Lipschitz with respect to the Wasserstein distance.

Theorem 11 (Bubeck-Cohen-Lee… 2018)The competitive ratio of mirror descent is bounded above by times the terme arising from mouvement cost analysis.

Theorem 12 (Borodin-Linial-Saks 1982)For the trivial distance, the competitive ratio is at least .

**Conjecture**. This is true for any metric.

My intuition is that the metric can only help us. Furthermore, there is an avatar of entropy, called multiscale entropy, associated with any metric. Usual entropy is well suited only to the trivial distance.