Keith A. Pray : Home : Academic : Machine Learning : Reinforcement : Slides : 13.4

Keith A. Pray - Professional and Academic Site

About Me

·	Home
·	Personal Site
·

Professional

·	Brief Biography
·	Recommendations
·	Resume
·	More...

Academic

·	ASAS Home
·	Animation
·	Lan Wan
·	Machine Learning
·	AI in Design
·	MQP: Cue
·	Ph.D. Home
·	MS Thesis
·	More...

Teaching

·	Social Implications Of Information Processing
·	Web Ware

Printer Friendly Version

Project 10.1: Reinforcement Learning

[ Intro ] [ 13.2 ] [ 13.3 ] [ 13.4 ]

[ Up: Reinforcement ]

Exercise 13.4

Note in many MDPs it is possible to find two policies p ₁ and p ₂ such that p ₁ outperforms p ₂ if the agent begins in some state s ₁, but p ₂ outperforms p ₁ if it begins in some other state s ₂. Put another way, V ^{p ₁} ( s ₁ ) > V ^{p ₂} ( s ₁ ), but V ^{p ₁} ( s ₂ ) > V ^{p ₂} ( s ₂ ).

Explain why there will always exist a single policy that maximizes V ^{p ₁} ( s ) for every single initial state s (i.e., an optimal policy p ^* ). In other words, explain why an MDP always allows a policy p ^* such that ( " p, s ) V ^{p ^*} ( s ) >= V ^p ( s )

Isn't this the same as Theorem 13.1. Convergence of Q learning for deterministic Markov decision processes? Basically it say that an MDP allows a policy, Q_hat, converge towards the optimal policy. As each possible state is visited infinitely often and the error of Q_hat is reduced, it must converge with the optimal policy.
Well, I suppose that is why the optimal policy can be found, but not why the optimal will always exist.

by: Keith A. Pray
Last Modified: July 4, 2004 9:03 AM

Kapowee Hosted | Kapow Generated in 0.008 second | XHTML | CSS