Keith A. Pray : Home : Academic : Machine Learning : Reinforcement : Slides : 13.4

Project 10.1: Reinforcement Learning

[ Intro ] [ 13.2 ] [ 13.3 ] [ 13.4 ]

Exercise 13.4

Note in many MDPs it is possible to find two policies p ₁ and p ₂ such that p ₁ outperforms p ₂ if the agent begins in some state s ₁, but p ₂ outperforms p ₁ if it begins in some other state s ₂. Put another way, V ^{p ₁} ( s ₁ ) > V ^{p ₂} ( s ₁ ), but V ^{p ₁} ( s ₂ ) > V ^{p ₂} ( s ₂ ).

Explain why there will always exist a single policy that maximizes V ^{p ₁} ( s ) for every single initial state s (i.e., an optimal policy p ^* ). In other words, explain why an MDP always allows a policy p ^* such that ( " p, s ) V ^{p ^*} ( s ) >= V ^p ( s )

Isn't this the same as Theorem 13.1. Convergence of Q learning for deterministic Markov decision processes? Basically it say that an MDP allows a policy, Q_hat, converge towards the optimal policy. As each possible state is visited infinitely often and the error of Q_hat is reduced, it must converge with the optimal policy.
Well, I suppose that is why the optimal policy can be found, but not why the optimal will always exist.

by: Keith A. Pray
Last Modified: July 4, 2004 9:03 AM