Keith A. Pray : Home : Academic : Machine Learning : Reinforcement : 13.3

Project 10.1: Reinforcement Learning

[ Intro ] [ 13.1 ] [ 13.2 ] [ 13.3 ] [ 13.4 ]

[ Slides ] [ Report ] [ Up: Machine Learning ]

Exercise 13.3

Consider playing Tic-Tac-Toe against an opponent who plays randomly. In particular, assume the opponent chooses with uniform probability any open space, unless there is a forced move (in which case it makes the obvious correct move).

Formulate the problem of learning an optimal Tic-Tac-Toe strategy in this case as a Q-learning task. What are the states, transitions, and rewards in this non-deterministic Markov decision process?


	_________________________________________________	
	|		|		|		|	
	|	        |	        |		|	
	|	      	|		|		|	
	|	1	|	2      	|	3	|	
	|		|		|		|	
	|       	|		|       	|	
	-------------------------------------------------	
	|		|		|		|	
	|		|		|		|	
	|	4	|	5	|	6	|	
	|	       	|	      	|		|	
	|		|		|		|	
	|		|		|		|	
	-------------------------------------------------	
	|		|		|		|	
	|		|		|		|	
	|	7	|	8	|	9	|	
	|	       	|		|		|	
	|		|		|		|	
	|_______________|_______________|_______________|

So, our states consist of which squares are occupied and by who. Let's make me X, I like X, don't you? Ok, so our states are any any list of squares with an X or O associated with each. Assume I go first too... just love going first, depending on the game.

States		: In terms of O occupies

Transitions	: X occupies next, 
		  Assuming not occupied already,
		  Not marked -> any open square

Rewards		: Not marked = 0

State	Transition	Reward
------	----------	------

1, 2	-> 3		= 1
1, 3	-> 2		= 1
1, 4	-> 7		= 1
1, 5	-> 9		= 1
1, 6	-> 
1, 7	-> 4		= 1
1, 8	-> 
1, 9	-> 5		= 1

2, 3	-> 1		= 1
2, 4	-> 
2, 5	-> 8		= 1
2, 6	-> 
2, 7	-> 
2, 8	-> 5		= 1
2, 9	-> 

3, 4	-> 
3, 5	-> 7		= 1
3, 6	-> 9		= 1
3, 7	-> 5		= 1
3, 8	-> 
3, 9	-> 6		= 1

4, 5	-> 6		= 1
4, 6	-> 5		= 1
4, 7	-> 1		= 1
4, 8	-> 
4, 9	-> 

5, 6	-> 4		= 1
5, 7	-> 3		= 1
5, 8	-> 2		= 1
5, 9	-> 1		= 1
	
6, 7	-> 
6, 8	-> 
6, 9	-> 3		= 1

7, 8	-> 9		= 1
7, 9	-> 8		= 1

8, 9	-> 7		= 1

Will your program succeed if the opponent plays optimally rather that randomly?

If the opponent plays optimally, the program will succeed just fine. Reward is given for simply blocking the opponent. Since it is impossible to win Tic-Tac-Toe against an optimal strategy, the program would never learn how to win. But, since not losing is good enough in Tic-Tac-Toe, does it really matter?

by: Keith A. Pray
Last Modified: July 4, 2004 9:03 AM