Multi-armed bandits#

https://miro.medium.com/max/339/0*l7Ra4R_CpJfc-hjz.png

Notation#

  • Actions: \(\mathcal A = \{a_1, \ldots, a_k\}\) (\(k\) levers)

  • \(A_t \in \mathcal A\) — action selected by the agent at time step \(t\)

  • Action value: \(q(a) = \mathbb E[R_t | A_t = a]\)

  • Optimal value: \(v_* = \max\limits_{a \in \mathcal A} q(a)\)

  • Optimal action: \(a_* = \arg\max\limits_{a \in \mathcal A} q(a)\)

  • Regret: \(\Delta_a = v_* - q(a)\)

  • \(Q_t(a)\) — estimation of \(q(a)\) at time step \(t\)

The goal is to maximize expected cumulative reward

\[ \sum\limits_{i=1}^t q(A_i) = \sum\limits_{i=1}^t \mathbb E[R_i | A_i = a] \to \max \]

This is the same as to minimize cumulative regret:

\[ \sum\limits_{i=1}^t \Delta_{A_i} \to \min \]

Sample-average method#

\[ Q_t(a) = \frac 1{N_t(a)}\sum\limits_{i=1}^{t-1} R_i \mathbb I[A_i=a], \]
\[ \quad N_t(a) = \sum\limits_{i=1}^{t-1} \mathbb I[A_i=a] \]

\(N_t(a)\) is the number of times the action \(a\) was taken. If \(N_t(a) = 0\), then \(Q_t(a)=0\) or some other default value.

Q. Suppose \(k=2\). What are \(Q_5(a)\) and \(Q_5(b)\)?

1

2

3

4

5

\(a\)

0

?

\(b\)

1

0

0

?

Greedy policy#

Select action with the highest estimation:

\[ A_t = \arg\max\limits_{a\in \mathcal A} Q_t(a), \text{ or } \pi_t(a) = \mathbb I[A_t = \arg\max\limits_{a\in \mathcal A} Q_t(a)] \]

Q. According to greedy policy, what is \(A_5\) in the previous example?

\(\varepsilon\)-greedy policy#

  • With probability \(1-\varepsilon\) select greedy action

\[ A_t = \arg\max\limits_{a\in \mathcal A} Q_t(a) \]
  • With probability \(\varepsilon\) select a random action

Q. If we follow \(\varepsilon\)-greedy strategy, what is the probability of selection of the greedy action?

Testbed#

A testbed for k-armed bandits is created in class Testbed. The reward produced by each arm is taken from \(\mathcal N(a_i, 1)\), where \(a_i\) are in turn sampled from the standard normal distribuion.

tb = Testbed(10, random_state=242)
idx = 11
print("Optimal action:", tb.optimal_action(idx) + 1)
print("Optimal mean reward:", tb.means[idx].max())
tb.plot(idx)
Optimal action: 1
Optimal mean reward: 1.3376758127929265
../_images/7b0ce620d5fa8999aeef4f17680f9bd9339694b721d89c24137e880751253e35.svg
%%time
eps_list = [0, 0.001, 0.01, 0.1, 0.5]
labels = [r"$\varepsilon = {}$".format(eps) for eps in eps_list]
rewards_list, optimal_action_list = mean_rewards_and_optimal_actions(eps_list, n_steps=5000)
CPU times: user 13.1 s, sys: 99.2 ms, total: 13.2 s
Wall time: 13.8 s
plot_mean_rewards(rewards_list, labels, 5000)
../_images/3c9ae6635241a8d4eb7a2e6d06d090a6f1166bcc501445e3df294b8120471a02.svg
plot_mean_optimal_actions(optimal_action_list, labels, 1000)
../_images/65f648193e77a6de5281890a1e61b4586138183e9b4d3b6c961d93631de3761d.svg

Q. What is the cumulative regret of \(\varepsilon\)-greedy policy for a single testbed? Calculate

\[ \lim\limits_{t\to\infty} \frac 1t \sum\limits_{i=1}^t \Delta_{A_i} \text{ and } \lim\limits_{t\to\infty} \frac 1t \sum\limits_{i=1}^t \mathbb P(A_i = a_*) \]

if all the actions \(A_i\) are selected according to \(\varepsilon\)-greedy policy.

Optimistic initial value#

Putting \(Q_1(a)=5\) for all actions will encourage the exploration in the first steps.

%%time
rewards_opt_0_const, actions_opt_0_const = train_eps_greedy_agent(0.0, 5.0, 0.1, n_steps=2000)
rewards_opt_01, actions_opt_01 = train_eps_greedy_agent(0.1, 5.0, n_steps=2000)
rewards_01_const, actions_01_const = train_eps_greedy_agent(0.1, 0.0, alpha=0.1, n_steps=2000)
CPU times: user 3.13 s, sys: 24.5 ms, total: 3.15 s
Wall time: 3.4 s
labels = [r"$Q_1=5, \varepsilon=0, \alpha=0.1$", r"$Q_1=5, \varepsilon=0.1$", r"$Q_1=0, \varepsilon=0.1, \alpha=0.1$"]
plot_mean_rewards([rewards_opt_0_const, rewards_opt_01, rewards_01_const], labels, 1000)
../_images/c84139c556180a56b0850e8785e3aee92d7598ce14ad16908a95f58f00ba7732.svg
plot_mean_optimal_actions([actions_opt_0_const, actions_opt_01, actions_01_const], labels, 1000)
../_images/b0c29eab406f75fe2bd969346e534b0657388afb3c72ea9025eaca54c1343550.svg

Upper Confidence Bound#

\[ A_t = \arg\max\limits_a \bigg\{Q_t(a) + c\sqrt{\frac{\log t}{N_t(a)}}\bigg\} \]

For more information about this mysterious formula see

  • Sutton&Barto, chapter 2.7

  • Lecture 2 from DeepMind, 53:20 — 1:10:00

def train_ucb_agent(c, n_steps=1000, n_runs=2000, k=10, random_state=242):
    mean_rewards = [] 
    mean_optimal_actions = []

    tb = Testbed(k, 0, n_runs, random_state)
    for t in range(n_steps):
        actions = tb.ucb_action(t + 1, c)
        optimals = (actions == tb.optimal_action())
        rewards = tb.step(actions)
        mean_rewards.append(rewards.mean())
        mean_optimal_actions.append(optimals.mean())
    
    return mean_rewards, mean_optimal_actions
%%time
rewards_eps, optimals_eps = train_eps_greedy_agent(0.1, n_steps=1000)
rewards_ucb_1, optimals_ucb_1 = train_ucb_agent(1, n_steps=1000)
rewards_ucb_2, optimals_ucb_2 = train_ucb_agent(2, n_steps=1000)
CPU times: user 1.75 s, sys: 21 ms, total: 1.77 s
Wall time: 1.82 s
labels = [r"ucb, $c=1$", r"ucb, $c=2$", r"$\varepsilon$-greedy, $\varepsilon=0.1$"]
plot_mean_rewards([rewards_ucb_1, rewards_ucb_2, rewards_eps], labels, 1000)
../_images/a30055a3304c374208a5f75d791e0a0476e9d4e42d927554ebad53c5b8550d95.svg
plot_mean_optimal_actions([optimals_ucb_1, optimals_ucb_2, optimals_eps], labels, 1000)
../_images/cc5e76f6381e196efe2528ab1563702986658cca47cadb435aec30bc738a7614.svg