|
|
Project 10.1: Reinforcement Learning
Exercise 13.3
Consider playing Tic-Tac-Toe against an opponent who plays randomly.
In particular, assume the opponent chooses with uniform
probability any open space, unless there is a forced move
(in which case it makes the obvious correct move).
-
Formulate the problem of learning an optimal Tic-Tac-Toe
strategy in this case as a Q-learning task.
What are the states, transitions, and rewards in this
non-deterministic Markov decision process?
_________________________________________________
| | | |
| | | |
| | | |
| 1 | 2 | 3 |
| | | |
| | | |
-------------------------------------------------
| | | |
| | | |
| 4 | 5 | 6 |
| | | |
| | | |
| | | |
-------------------------------------------------
| | | |
| | | |
| 7 | 8 | 9 |
| | | |
| | | |
|_______________|_______________|_______________|
|
So, our states consist of which squares are occupied and by who.
Let's make me X, I like X, don't you?
Ok, so our states are any any list of squares with an X or O
associated with each. Assume I go first too... just love going first,
depending on the game.
States : In terms of O occupies
Transitions : X occupies next,
Assuming not occupied already,
Not marked -> any open square
Rewards : Not marked = 0
State Transition Reward
------ ---------- ------
1, 2 -> 3 = 1
1, 3 -> 2 = 1
1, 4 -> 7 = 1
1, 5 -> 9 = 1
1, 6 ->
1, 7 -> 4 = 1
1, 8 ->
1, 9 -> 5 = 1
2, 3 -> 1 = 1
2, 4 ->
2, 5 -> 8 = 1
2, 6 ->
2, 7 ->
2, 8 -> 5 = 1
2, 9 ->
3, 4 ->
3, 5 -> 7 = 1
3, 6 -> 9 = 1
3, 7 -> 5 = 1
3, 8 ->
3, 9 -> 6 = 1
4, 5 -> 6 = 1
4, 6 -> 5 = 1
4, 7 -> 1 = 1
4, 8 ->
4, 9 ->
5, 6 -> 4 = 1
5, 7 -> 3 = 1
5, 8 -> 2 = 1
5, 9 -> 1 = 1
6, 7 ->
6, 8 ->
6, 9 -> 3 = 1
7, 8 -> 9 = 1
7, 9 -> 8 = 1
8, 9 -> 7 = 1
Will your program succeed if the opponent plays optimally
rather that randomly?
If the opponent plays optimally, the program will succeed
just fine.
Reward is given for simply blocking the opponent.
Since it is impossible to win Tic-Tac-Toe against
an optimal strategy, the program would never learn how to win.
But, since not losing is good enough in Tic-Tac-Toe, does it really
matter?
|
|