Keith A. Pray - Professional and Academic Site
About Me
·
·
·
LinkedIn Profile Facebook Profile GoodReads Profile
Professional
Academic
Teaching
                                          
Printer Friendly Version
Project 10.1: Reinforcement Learning

Intro ] [ 13.2 ] [ 13.3 ] [ 13.4 ]

Up: Reinforcement ]

Exercise 13.3

Consider playing Tic-Tac-Toe against an opponent who plays randomly. In particular, assume the opponent chooses with uniform probability any open space, unless there is a forced move (in which case it makes the obvious correct move).


  1. Formulate the problem of learning an optimal Tic-Tac-Toe strategy in this case as a Q-learning task. What are the states, transitions, and rewards in this non-deterministic Markov decision process?

    
    	_________________________________________________	
    	|		|		|		|	
    	|	        |	        |		|	
    	|	      	|		|		|	
    	|	1	|	2      	|	3	|	
    	|		|		|		|	
    	|       	|		|       	|	
    	-------------------------------------------------	
    	|		|		|		|	
    	|		|		|		|	
    	|	4	|	5	|	6	|	
    	|	       	|	      	|		|	
    	|		|		|		|	
    	|		|		|		|	
    	-------------------------------------------------	
    	|		|		|		|	
    	|		|		|		|	
    	|	7	|	8	|	9	|	
    	|	       	|		|		|	
    	|		|		|		|	
    	|_______________|_______________|_______________|	
    	  
    So, our states consist of which squares are occupied and by who. Let's make me X, I like X, don't you? Ok, so our states are any any list of squares with an X or O associated with each. Assume I go first too... just love going first, depending on the game.
    States		: In terms of O occupies
    
    Transitions	: X occupies next, 
    		  Assuming not occupied already,
    		  Not marked -> any open square
    
    Rewards		: Not marked = 0
    
    State	Transition	Reward
    ------	----------	------
    
    1, 2	-> 3		= 1
    1, 3	-> 2		= 1
    1, 4	-> 7		= 1
    1, 5	-> 9		= 1
    1, 6	-> 
    1, 7	-> 4		= 1
    1, 8	-> 
    1, 9	-> 5		= 1
    
    2, 3	-> 1		= 1
    2, 4	-> 
    2, 5	-> 8		= 1
    2, 6	-> 
    2, 7	-> 
    2, 8	-> 5		= 1
    2, 9	-> 
    
    3, 4	-> 
    3, 5	-> 7		= 1
    3, 6	-> 9		= 1
    3, 7	-> 5		= 1
    3, 8	-> 
    3, 9	-> 6		= 1
    
    4, 5	-> 6		= 1
    4, 6	-> 5		= 1
    4, 7	-> 1		= 1
    4, 8	-> 
    4, 9	-> 
    
    5, 6	-> 4		= 1
    5, 7	-> 3		= 1
    5, 8	-> 2		= 1
    5, 9	-> 1		= 1
    	
    6, 7	-> 
    6, 8	-> 
    6, 9	-> 3		= 1
    
    7, 8	-> 9		= 1
    7, 9	-> 8		= 1
    
    8, 9	-> 7		= 1
    
    	
    Will your program succeed if the opponent plays optimally rather that randomly?

    If the opponent plays optimally, the program will succeed just fine. Reward is given for simply blocking the opponent. Since it is impossible to win Tic-Tac-Toe against an optimal strategy, the program would never learn how to win. But, since not losing is good enough in Tic-Tac-Toe, does it really matter?


by: Keith A. Pray
Last Modified: July 4, 2004 9:03 AM
© 2004 - 1975 Keith A. Pray.
All rights reserved.

Current Theme: 

Kapowee Hosted | Kapow Generated in 0.008 second | XHTML | CSS