Multi-armed bandits#
Notation#
Actions: \(\mathcal A = \{a_1, \ldots, a_k\}\) (\(k\) levers)
\(A_t \in \mathcal A\) — action selected by the agent at time step \(t\)
Action value: \(q(a) = \mathbb E[R_t | A_t = a]\)
Optimal value: \(v_* = \max\limits_{a \in \mathcal A} q(a)\)
Optimal action: \(a_* = \arg\max\limits_{a \in \mathcal A} q(a)\)
Regret: \(\Delta_a = v_* - q(a)\)
\(Q_t(a)\) — estimation of \(q(a)\) at time step \(t\)
The goal is to maximize expected cumulative reward
This is the same as to minimize cumulative regret:
Sample-average method#
\(N_t(a)\) is the number of times the action \(a\) was taken. If \(N_t(a) = 0\), then \(Q_t(a)=0\) or some other default value.
Q. Suppose \(k=2\). What are \(Q_5(a)\) and \(Q_5(b)\)?
1 |
2 |
3 |
4 |
5 |
|
---|---|---|---|---|---|
\(a\) |
0 |
? |
|||
\(b\) |
1 |
0 |
0 |
? |
Greedy policy#
Select action with the highest estimation:
Q. According to greedy policy, what is \(A_5\) in the previous example?
\(\varepsilon\)-greedy policy#
With probability \(1-\varepsilon\) select greedy action
With probability \(\varepsilon\) select a random action
Q. If we follow \(\varepsilon\)-greedy strategy, what is the probability of selection of the greedy action?
Testbed#
A testbed for k-armed bandits is created in class Testbed
. The reward produced by each arm is taken from \(\mathcal N(a_i, 1)\), where \(a_i\) are in turn sampled from the standard normal distribuion.
tb = Testbed(10, random_state=242)
idx = 11
print("Optimal action:", tb.optimal_action(idx) + 1)
print("Optimal mean reward:", tb.means[idx].max())
tb.plot(idx)
Optimal action: 1
Optimal mean reward: 1.3376758127929265
%%time
eps_list = [0, 0.001, 0.01, 0.1, 0.5]
labels = [r"$\varepsilon = {}$".format(eps) for eps in eps_list]
rewards_list, optimal_action_list = mean_rewards_and_optimal_actions(eps_list, n_steps=5000)
CPU times: user 12.9 s, sys: 71 ms, total: 13 s
Wall time: 13.2 s
plot_mean_rewards(rewards_list, labels, 5000)
plot_mean_optimal_actions(optimal_action_list, labels, 1000)
Q. What is the cumulative regret of \(\varepsilon\)-greedy policy for a single testbed? Calculate
if all the actions \(A_i\) are selected according to \(\varepsilon\)-greedy policy.
Optimistic initial value#
Putting \(Q_1(a)=5\) for all actions will encourage the exploration in the first steps.
%%time
rewards_opt_0_const, actions_opt_0_const = train_eps_greedy_agent(0.0, 5.0, 0.1, n_steps=2000)
rewards_opt_01, actions_opt_01 = train_eps_greedy_agent(0.1, 5.0, n_steps=2000)
rewards_01_const, actions_01_const = train_eps_greedy_agent(0.1, 0.0, alpha=0.1, n_steps=2000)
CPU times: user 3.04 s, sys: 16.1 ms, total: 3.05 s
Wall time: 3.08 s
labels = [r"$Q_1=5, \varepsilon=0, \alpha=0.1$", r"$Q_1=5, \varepsilon=0.1$", r"$Q_1=0, \varepsilon=0.1, \alpha=0.1$"]
plot_mean_rewards([rewards_opt_0_const, rewards_opt_01, rewards_01_const], labels, 1000)
plot_mean_optimal_actions([actions_opt_0_const, actions_opt_01, actions_01_const], labels, 1000)
Upper Confidence Bound#
For more information about this mysterious formula see
Sutton&Barto, chapter 2.7
Lecture 2 from DeepMind, 53:20 — 1:10:00
def train_ucb_agent(c, n_steps=1000, n_runs=2000, k=10, random_state=242):
mean_rewards = []
mean_optimal_actions = []
tb = Testbed(k, 0, n_runs, random_state)
for t in range(n_steps):
actions = tb.ucb_action(t + 1, c)
optimals = (actions == tb.optimal_action())
rewards = tb.step(actions)
mean_rewards.append(rewards.mean())
mean_optimal_actions.append(optimals.mean())
return mean_rewards, mean_optimal_actions
%%time
rewards_eps, optimals_eps = train_eps_greedy_agent(0.1, n_steps=1000)
rewards_ucb_1, optimals_ucb_1 = train_ucb_agent(1, n_steps=1000)
rewards_ucb_2, optimals_ucb_2 = train_ucb_agent(2, n_steps=1000)
CPU times: user 1.83 s, sys: 20.4 ms, total: 1.85 s
Wall time: 1.88 s
labels = [r"ucb, $c=1$", r"ucb, $c=2$", r"$\varepsilon$-greedy, $\varepsilon=0.1$"]
plot_mean_rewards([rewards_ucb_1, rewards_ucb_2, rewards_eps], labels, 1000)
plot_mean_optimal_actions([optimals_ucb_1, optimals_ucb_2, optimals_eps], labels, 1000)