Markov Decision Process (MDP)#
MDP-1#
is a set of all possible states — set of actions (can depend on ) — joint probability distribution of reward and next state given state and action — discount factor
MDP-2#
is a set of all possible states — set of actions is the probability of transitioning to , given a state and action — expected reward: — discount factor
Question
How to obtain MDP-2 from MDP-1?
Answer
Value function#
Value (expected return) of following policy
Bellman expectation equality for
MDP-1:
MDP-2:
Note that we need to fix some policy
Exericse
Check that Bellman expectation equality holds for the center state of the gridworld.

grid = GridWorld(5, 5, [(1, 21, 10), (3, 13, 5)])
values = grid.bellman_solution()
sns.heatmap(values, cmap='RdBu', annot=True, fmt=".3f");
Action-value function#
Value (expected return) of following policy
Question
How to write
Answer
Bellman expectation equality for
MDP-1:
Since
MDP-2:
Once again the equation stands for some fixed policy
Optimal policy#
Optimal policy
Note that both
If we now optimal action-value function
Bellman optimality equality for
MDP-1:
MDP-2:
Bellman optimality equality for
MDP-1:
MDP-2:

Exercise
Compute
10 / (1- 0.9**5)
24.419428096993972
Value iteraion#
Initialize
, for allWhile
: ,
After
Q. How to update the value function in MDP-2?
Apply the value iteration methods to the gridworld environment:
grid = GridWorld(5, 5, [(1, 21, 10), (3, 13, 5)])
for i in range(1000):
diff = grid.update_state_values()
print(f"diff at iteration {i}: {diff:.6f}")
if diff < 1e-3:
break
diff at iteration 0: 11.180340
diff at iteration 1: 16.837458
diff at iteration 2: 15.153712
diff at iteration 3: 14.390083
diff at iteration 4: 13.201365
diff at iteration 5: 11.359212
diff at iteration 6: 10.087099
diff at iteration 7: 8.899113
diff at iteration 8: 8.128209
diff at iteration 9: 7.426842
diff at iteration 10: 6.520376
diff at iteration 11: 5.832493
diff at iteration 12: 5.165771
diff at iteration 13: 4.786819
diff at iteration 14: 4.422070
diff at iteration 15: 3.967694
diff at iteration 16: 3.580832
diff at iteration 17: 3.102996
diff at iteration 18: 2.886449
diff at iteration 19: 2.703556
diff at iteration 20: 2.447296
diff at iteration 21: 2.182663
diff at iteration 22: 1.821722
diff at iteration 23: 1.697455
diff at iteration 24: 1.627680
diff at iteration 25: 1.530449
diff at iteration 26: 1.368849
diff at iteration 27: 1.087588
diff at iteration 28: 1.036305
diff at iteration 29: 1.036065
diff at iteration 30: 1.016825
diff at iteration 31: 0.915050
diff at iteration 32: 0.679617
diff at iteration 33: 0.618063
diff at iteration 34: 0.621914
diff at iteration 35: 0.613145
diff at iteration 36: 0.551831
diff at iteration 37: 0.405511
diff at iteration 38: 0.364960
diff at iteration 39: 0.367234
diff at iteration 40: 0.362056
diff at iteration 41: 0.325851
diff at iteration 42: 0.239450
diff at iteration 43: 0.215505
diff at iteration 44: 0.216848
diff at iteration 45: 0.213791
diff at iteration 46: 0.192412
diff at iteration 47: 0.141393
diff at iteration 48: 0.127254
diff at iteration 49: 0.128047
diff at iteration 50: 0.126241
diff at iteration 51: 0.113617
diff at iteration 52: 0.083491
diff at iteration 53: 0.075142
diff at iteration 54: 0.075610
diff at iteration 55: 0.074544
diff at iteration 56: 0.067090
diff at iteration 57: 0.049301
diff at iteration 58: 0.044371
diff at iteration 59: 0.044647
diff at iteration 60: 0.044018
diff at iteration 61: 0.039616
diff at iteration 62: 0.029112
diff at iteration 63: 0.026200
diff at iteration 64: 0.026364
diff at iteration 65: 0.025992
diff at iteration 66: 0.023393
diff at iteration 67: 0.017190
diff at iteration 68: 0.015471
diff at iteration 69: 0.015567
diff at iteration 70: 0.015348
diff at iteration 71: 0.013813
diff at iteration 72: 0.010151
diff at iteration 73: 0.009136
diff at iteration 74: 0.009192
diff at iteration 75: 0.009063
diff at iteration 76: 0.008157
diff at iteration 77: 0.005994
diff at iteration 78: 0.005394
diff at iteration 79: 0.005428
diff at iteration 80: 0.005352
diff at iteration 81: 0.004816
diff at iteration 82: 0.003539
diff at iteration 83: 0.003185
diff at iteration 84: 0.003205
diff at iteration 85: 0.003160
diff at iteration 86: 0.002844
diff at iteration 87: 0.002090
diff at iteration 88: 0.001881
diff at iteration 89: 0.001893
diff at iteration 90: 0.001866
diff at iteration 91: 0.001679
diff at iteration 92: 0.001234
diff at iteration 93: 0.001111
diff at iteration 94: 0.001118
diff at iteration 95: 0.001102
diff at iteration 96: 0.000992
sns.heatmap(grid.values.reshape(5,5), cmap='RdBu', annot=True, fmt=".3f");
Policy iteration#
Initialize
Compute the value function
by solving Bellman expectation linear systemCompute the value-action function
Compute new policy
Q. Rewrite update of