Project 10.1: Reinforcement Learning

Intro ] [ 13.2 ] [ 13.3 ] [ 13.4 ]

Up: Reinforcement ]

Exercise 13.2

Consider the deterministic grid world shown below with the absorbing goal-state G. Here the immediate rewards are 10 for the labeled transitions and 0 for all unlabeled transitions.


	_________________________________________________	
	|		|		|		|	
	|		|		|		|	
	|	      --->	      --->		|	
	|	      <---	      <---		|	
	|		|		|		|	
	|      ^ |	|	|10	|      ^ |	|	
	-------|-|--------------|--------------|-|-------	
	|      | v	|	v	|      | v	|	
	|		|		|		|	
	|	      10|	G	|10		|	
	|	       --->	      <---		|	
	|		|		|		|	
	|_______________|_______________|_______________|	
	  

    1. * Give the V* value for every state in this grid world.

      
      _________________________________________________
      |		|		|		|
      |		|		|		|
      |	8      --->	10     --->	8	|
      |	      <---	      <---		|
      |		|		|		|
      |      ^ |	|	|	|      ^ |	|
      -------|-|--------------|--------------|-|-------
      |      | v	|	v	|      | v	|
      |		|		|		|
      |	10      |	G	|	10	|
      |	       --->	      <---		|
      |		|	0	|		|
      |_______________|_______________|_______________|
      		  
    2. * Give the Q ( s, a ) value for every transition.

      
      _________________________________________________
      |		|		|		|
      |	       8|	     6.4|		|
      |	       --->	       --->		|
      |	      <---	      <---		|
      |		|6.4		|8		|
      |      ^ |8	|	|10	|      ^ |8	|
      -------|-|--------------|--------------|-|-------
      |   6.4| v	|	v	|   6.4| v	|
      |		|		|		|
      |	        |	G	|		|
      |	       --->	      <---		|
      |	      10|	0	|10		|
      |_______________|_______________|_______________|
      		  
    3. Finally, show an optimal policy.
      * Use g = 0.8. (Assume this is for all parts)

      
      _________________________________________________
      |		|		|		|
      |		|		|		|
      |	      	|		|		|
      |	        |	      	|		|
      |		|		|		|
      |       |	|	|	|       |	|
      --------|---------------|---------------|--------
      |       v	|	v	|       v	|
      |		|		|		|
      |		|	G	|		|
      |	       --->	      <---		|
      |		|		|		|
      |_______________|_______________|_______________|
      		  

    1. Suggest a change to the reward function r ( s, a ) that alters the Q ( s, a ) values, but does not alter the optimal policy.

      Assuming that trivial things such as changing all the reward values from 10 to some other constant or changing g or both are not what this is asking, so... I suppose you could...

      
      _________________________________________________
      |		|		|		|
      |	       1|	       1|		|
      |	      --->	      --->		|
      |	      <---	      <---		|
      |		|1		|1		|
      |      ^ |1	|	|10	|      ^ |1	|
      -------|-|--------------|--------------|-|-------
      |     1| v	|	v	|     1| v	|
      |		|		|		|
      |	      10|	G	|10		|
      |	       --->	      <---		|
      |		|		|		|
      |_______________|_______________|_______________|
      	  

    2. Suggest a change to r ( s, a ) that alters Q ( s, a ) but does not alter V* ( s, a ).

      
      _________________________________________________
      |		|		|		|
      |	        |	       1|		|
      |	      --->	      --->		|
      |	      <---	      <---		|
      |		|1		|		|
      |      ^ |	|	|10	|      ^ |	|
      -------|-|--------------|--------------|-|-------
      |     1| v	|	v	|     1| v	|
      |		|		|		|
      |	      10|	G	|10		|
      |	       --->	      <---		|
      |		|		|		|
      |_______________|_______________|_______________|
      	  

  1. Now consider applying the Q learning algorithm to this grid world, assuming the table of Q_hat values is initialized to zero. Assume the agent begins in the bottom left grid square and then travels clockwise around the perimeter of the grid until it reaches the absorbing goal state, completing the first training episode.

    1. Describe which Q_hat values are modified as a result of this episode, and give their revised values.

      Assume we start with the original grid world of this problem. All Q_hats not specified are zero.

      
      _________________________________________________
      |		|		|		|
      |	        |	       |		|
      |	      --->	      --->		|
      |	      <---	      <---		|
      |		|		|		|
      |      ^ |	|	|	|      ^ |	|
      -------|-|--------------|--------------|-|-------
      |      | v	|	v	|      | v	|
      |		|		|		|
      |	        |	G	|10		|
      |	       --->	      <---		|
      |		|		|		|
      |_______________|_______________|_______________|
      	  

    2. Answer the question again assuming the agent now performs a second identical episode.

      
      _________________________________________________
      |		|		|		|
      |	        |	       |		|
      |	      --->	      --->		|
      |	      <---	      <---		|
      |		|		|		|
      |      ^ |	|	|	|      ^ |8	|
      -------|-|--------------|--------------|-|-------
      |      | v	|	v	|      | v	|
      |		|		|		|
      |	        |	G	|10		|
      |	       --->	      <---		|
      |		|		|		|
      |_______________|_______________|_______________|
      	  

    3. Answer it again for a third episode.

      
      _________________________________________________
      |		|		|		|
      |	        |	     6.4|		|
      |	       --->	       --->		|
      |	      <---	      <---		|
      |		|		|		|
      |      ^ |	|	|	|      ^ |8	|
      -------|-|--------------|--------------|-|-------
      |      | v	|	v	|      | v	|
      |		|		|		|
      |	        |	G	|10		|
      |	       --->	      <---		|
      |		|		|		|
      |_______________|_______________|_______________|
      	  

 

by: Keith A. Pray
Last Modified: July 4, 2004 9:03 AM
© 2004 - 1975 Keith A. Pray.
All rights reserved.