Keith A. Pray - Professional and Academic Site
About Me
·
·
·
LinkedIn Profile Facebook Profile GoodReads Profile
Professional
Academic
Teaching
                                          
Printer Friendly Version
Project 10.1: Reinforcement Learning

Intro ] [ 13.1 ] [ 13.2 ] [ 13.3 ] [ 13.4 ]

Slides ] [ Report ] [ Up: Machine Learning ]

Exercise 13.4

Note in many MDPs it is possible to find two policies p 1 and p 2 such that p 1 outperforms p 2 if the agent begins in some state s 1, but p 2 outperforms p 1 if it begins in some other state s 2. Put another way, V p 1 ( s 1 ) > V p 2 ( s 1 ), but V p 1 ( s 2 ) > V p 2 ( s 2 ).


Explain why there will always exist a single policy that maximizes V p 1 ( s ) for every single initial state s (i.e., an optimal policy p * ). In other words, explain why an MDP always allows a policy p * such that ( " p, s ) V p * ( s ) >= V p ( s )

Isn't this the same as Theorem 13.1. Convergence of Q learning for deterministic Markov decision processes? Basically it say that an MDP allows a policy, Q_hat, converge towards the optimal policy. As each possible state is visited infinitely often and the error of Q_hat is reduced, it must converge with the optimal policy.
Well, I suppose that is why the optimal policy can be found, but not why the optimal will always exist.


by: Keith A. Pray
Last Modified: July 4, 2004 9:03 AM
© 2004 - 1975 Keith A. Pray.
All rights reserved.

Current Theme: 

Kapowee Hosted | Kapow Generated in 0.008 second | XHTML | CSS