# Performance of index policies for markovian bandits

 Labo INRIA Equipe POLARIS Encadrants nicolas.gast@inria.fr

### Context

Markovian Bandits with discounted payoffs are composed of $N$ Bandits, each modeled by a finite Markov chain. The reward provided by using bandit $n$ is $r_n(X)$ when the bandit $n$ is in state $X$. Here, we consider the model of restless bandit in which the bandit changes state according to a transition matrix $P^{(n)}_a$ if it is chosen and $P^{(n)}_b$ if it is not chosen. This generalizes the classical stochastic *i.i.d.* bandits, that always provide rewards with the same distribution.

The optimal selection of the next bandit can be formulated as a Markov Decision Process. If $P^{(n)}_b=I$, the bandit does not change state when it is not chosen. This is called the restful case. In this case, the optimal policy can be computed using Gittins index: the problem can be decomposed into the computation of an index for each bandit. The optimal choice in any configuration is to select the bandit with the smallest index. The computation of the index of each bandit is itself the solution of a Markov decision process (whose size is simply the size of the Markov chain of this bandit), see for example the monograph of [2].

When a bandit can change state even if it is not chosen ($P^{(n)}_b\ne I)$, we fall in the category of restless bandits. This case is much harder to solve as in general index policies are no longer optimal. In [4], a generalization of Gittins index has been proposed and they are known to work very well in practice. In particular, they have been show to be asymptotically optimal as the number of arms go to infinity [3].

### Objectives

In a recent paper [1], two new policies are introduced: an LP relaxation of the problem and a diffusion index policy. This paper claims to obtain policies with proven guarantees ($o(N)$ and $O(\sqrt{N})$ convergence. This paper studies a small example that seem to indicate that the proposed policies work well but the simulations are quite limited. The goal of this project will be to redo the experimental part of [1] with more convincing examples and to compare the performance of the performance of Whittle index policies with those policies.

To achieve this, the work to be done will be:

- Familiarize with the notion of Markovian bandit (restless and restful) - Understand the concept of index policies and the algorithms to compute such indices - Understand a recent research article [1] and implement the optimization algorithms that are described in the paper. - Implement a numerical simulator that implement the different policies in order to conduct a thorough numerical comparison.

### References

[1] (private communication). “Near-Optimality for Finite-Horizon Bayesian Restless Bandits with Many Arms”. 2019.

[2] J. Gittins, K. Glazebrook, and R. Weber. Multi-armed bandit allocation indices. John Wiley & Sons, 2011.

[3] I. M. Verloop. “Asymptotically optimal priority policies for indexable and nonindexable restless bandits”. In: Ann. Appl. Probab. 26.4 (Aug. 2016), pp. 1947–1995. doi: 10.1214/15-AAP1137. url: https://doi.

org/10.1214/15-AAP1137.

[4] R. R. Weber and G. Weiss. “On an index policy for restless bandits”. In: Journal of Applied Probability 27.3