Multi-armed bandits#

https://miro.medium.com/max/339/0*l7Ra4R_CpJfc-hjz.png

Notation#

  • Actions: \(\mathcal A = \{a_1, \ldots, a_k\}\) (\(k\) levers)

  • \(A_t \in \mathcal A\) — action selected by the agent at time step \(t\)

  • Action value: \(q(a) = \mathbb E[R_t | A_t = a]\)

  • Optimal value: \(v_* = \max\limits_{a \in \mathcal A} q(a)\)

  • Optimal action: \(a_* = \arg\max\limits_{a \in \mathcal A} q(a)\)

  • Regret: \(\Delta_a = v_* - q(a)\)

  • \(Q_t(a)\) — estimation of \(q(a)\) at time step \(t\)

The goal is to maximize expected cumulative reward

\[ \sum\limits_{i=1}^t q(A_i) = \sum\limits_{i=1}^t \mathbb E[R_i | A_i = a] \to \max \]

This is the same as to minimize cumulative regret:

\[ \sum\limits_{i=1}^t \Delta_{A_i} \to \min \]

Sample-average method#

\[ Q_t(a) = \frac 1{N_t(a)}\sum\limits_{i=1}^{t-1} R_i \mathbb I[A_i=a], \]
\[ \quad N_t(a) = \sum\limits_{i=1}^{t-1} \mathbb I[A_i=a] \]

\(N_t(a)\) is the number of times the action \(a\) was taken. If \(N_t(a) = 0\), then \(Q_t(a)=0\) or some other default value.

Q. Suppose \(k=2\). What are \(Q_5(a)\) and \(Q_5(b)\)?

1

2

3

4

5

\(a\)

0

?

\(b\)

1

0

0

?

Greedy policy#

Select action with the highest estimation:

\[ A_t = \arg\max\limits_{a\in \mathcal A} Q_t(a), \text{ or } \pi_t(a) = \mathbb I[A_t = \arg\max\limits_{a\in \mathcal A} Q_t(a)] \]

Q. According to greedy policy, what is \(A_5\) in the previous example?

\(\varepsilon\)-greedy policy#

  • With probability \(1-\varepsilon\) select greedy action

\[ A_t = \arg\max\limits_{a\in \mathcal A} Q_t(a) \]
  • With probability \(\varepsilon\) select a random action

Q. If we follow \(\varepsilon\)-greedy strategy, what is the probability of selection of the greedy action?

Testbed#

A testbed for k-armed bandits is created in class Testbed. The reward produced by each arm is taken from \(\mathcal N(a_i, 1)\), where \(a_i\) are in turn sampled from the standard normal distribuion.

tb = Testbed(10, random_state=242)
idx = 11
print("Optimal action:", tb.optimal_action(idx) + 1)
print("Optimal mean reward:", tb.means[idx].max())
tb.plot(idx)
Optimal action: 1
Optimal mean reward: 1.3376758127929265
../_images/98c329e61e87fbb37a04d10f0361895ac372dd8d4bb04f919fad8e1313f5e012.svg
%%time
eps_list = [0, 0.001, 0.01, 0.1, 0.5]
labels = [r"$\varepsilon = {}$".format(eps) for eps in eps_list]
rewards_list, optimal_action_list = mean_rewards_and_optimal_actions(eps_list, n_steps=5000)
CPU times: user 12.9 s, sys: 71 ms, total: 13 s
Wall time: 13.2 s
plot_mean_rewards(rewards_list, labels, 5000)
../_images/9f1ab44af4ea419581a8adf6b072a07f29118cb3a143cbd45caa5d81440e4a3f.svg
plot_mean_optimal_actions(optimal_action_list, labels, 1000)
../_images/a8c333678901be06c101adcd673c2a783603c157d8838d3406668491f6d79c6c.svg

Q. What is the cumulative regret of \(\varepsilon\)-greedy policy for a single testbed? Calculate

\[ \lim\limits_{t\to\infty} \frac 1t \sum\limits_{i=1}^t \Delta_{A_i} \text{ and } \lim\limits_{t\to\infty} \frac 1t \sum\limits_{i=1}^t \mathbb P(A_i = a_*) \]

if all the actions \(A_i\) are selected according to \(\varepsilon\)-greedy policy.

Optimistic initial value#

Putting \(Q_1(a)=5\) for all actions will encourage the exploration in the first steps.

%%time
rewards_opt_0_const, actions_opt_0_const = train_eps_greedy_agent(0.0, 5.0, 0.1, n_steps=2000)
rewards_opt_01, actions_opt_01 = train_eps_greedy_agent(0.1, 5.0, n_steps=2000)
rewards_01_const, actions_01_const = train_eps_greedy_agent(0.1, 0.0, alpha=0.1, n_steps=2000)
CPU times: user 3.04 s, sys: 16.1 ms, total: 3.05 s
Wall time: 3.08 s
labels = [r"$Q_1=5, \varepsilon=0, \alpha=0.1$", r"$Q_1=5, \varepsilon=0.1$", r"$Q_1=0, \varepsilon=0.1, \alpha=0.1$"]
plot_mean_rewards([rewards_opt_0_const, rewards_opt_01, rewards_01_const], labels, 1000)
../_images/ce30fc0e9ce29e056496c2e525faebc502c6c528a1e6cf5d20939837b494c592.svg
plot_mean_optimal_actions([actions_opt_0_const, actions_opt_01, actions_01_const], labels, 1000)
../_images/122aff111d3985906cebb9c5ca2c7fca748ab300e5d4b36946f00050678eae8e.svg

Upper Confidence Bound#

\[ A_t = \arg\max\limits_a \bigg\{Q_t(a) + c\sqrt{\frac{\log t}{N_t(a)}}\bigg\} \]

For more information about this mysterious formula see

  • Sutton&Barto, chapter 2.7

  • Lecture 2 from DeepMind, 53:20 — 1:10:00

def train_ucb_agent(c, n_steps=1000, n_runs=2000, k=10, random_state=242):
    mean_rewards = [] 
    mean_optimal_actions = []

    tb = Testbed(k, 0, n_runs, random_state)
    for t in range(n_steps):
        actions = tb.ucb_action(t + 1, c)
        optimals = (actions == tb.optimal_action())
        rewards = tb.step(actions)
        mean_rewards.append(rewards.mean())
        mean_optimal_actions.append(optimals.mean())
    
    return mean_rewards, mean_optimal_actions
%%time
rewards_eps, optimals_eps = train_eps_greedy_agent(0.1, n_steps=1000)
rewards_ucb_1, optimals_ucb_1 = train_ucb_agent(1, n_steps=1000)
rewards_ucb_2, optimals_ucb_2 = train_ucb_agent(2, n_steps=1000)
CPU times: user 1.83 s, sys: 20.4 ms, total: 1.85 s
Wall time: 1.88 s
labels = [r"ucb, $c=1$", r"ucb, $c=2$", r"$\varepsilon$-greedy, $\varepsilon=0.1$"]
plot_mean_rewards([rewards_ucb_1, rewards_ucb_2, rewards_eps], labels, 1000)
../_images/35dfdb89d797aedbf345559ce3a4fbb7e10526962a9099daa0834db389104d65.svg
plot_mean_optimal_actions([optimals_ucb_1, optimals_ucb_2, optimals_eps], labels, 1000)
../_images/e37442e68012f8c696df7bd06e1361eb8452ac2447c02faf4aac1dd316516625.svg