Keith A. Pray - Professional and Academic Site | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Project 10.1: Reinforcement Learning
Exercise 13.4
Explain why there will always exist a single policy that maximizes V p 1 ( s ) for every single initial state s (i.e., an optimal policy p * ). In other words, explain why an MDP always allows a policy p * such that ( " p, s ) V p * ( s ) >= V p ( s ) Isn't this the same as Theorem 13.1. Convergence of Q learning for deterministic Markov decision processes? Basically it say that an MDP allows a policy, Q_hat, converge towards the optimal policy. As each possible state is visited infinitely often and the error of Q_hat is reduced, it must converge with the optimal policy. Well, I suppose that is why the optimal policy can be found, but not why the optimal will always exist. by: Keith A. Pray Last Modified: July 4, 2004 9:03 AM |
|||||||||||||||||||||||||||||||||||||||||||||||||
|
Kapowee Hosted | Kapow Generated in 0.008 second | XHTML | CSS