Professional Documents
Culture Documents
Report of Final Project Solving Sliding Puzzle Game Using Reinforcement Learning Technique Course: ECE 517
Report of Final Project Solving Sliding Puzzle Game Using Reinforcement Learning Technique Course: ECE 517
Mesbah Uddin
Abstract: The problem described in this project is a puzzle game where the goal is to sort the tiles in
ascending order. We implemented reinforcement learning technique to train an agent to solve. We tried to
create a general solution for puzzles of all size. Our approach for solving the problem was to divide the
problem into many local sub-problems and is based on unsupervised learning. We trained our agent for all
these local optimal policies and combined them together to generate the overall solution.
Introduction:
There has been numerous application of reinforcement learning in various arcade games.
Since most of these games can be modeled by Markov Process, these games have been a
standard platform to test the effectiveness of various reinforcement learning (RL) algorithms.
One of the main problems that limit the applicability of RL in many of these games is the
unmanageable amount of state space. There are many different ways in which this problem
can be tackled. One such way is intelligent state space representation that can dramatically
reduce the state space. Other ways can be performing training on small state space and using
it to on larger state space that is obtaining a policy that can be scaled. In this project we tried
to solve this problem in the game of sliding puzzle.
Background:
In a sliding puzzle game, the target is to move the pieces to create a sorted configuration in
ascending order. There is a blank piece in the configuration which can be used to shift or
move tile pieces. The most popular versions of this game are 15-puzzle (4x4) and 8-puzzle
(3x3) as shown in the figure.
State-space:
There is a possible (n2)! combinations of states for an n-by-n sliding puzzle. So for a 3X3
puzzle, the possible combination is 9! or 362880. For a 4X4 and 5X5, this number is almost
2x1013 and 1.5x1025.
Solvability:
Not all configurations of a sliding puzzle are solvable. It is very apparent from a 2X2 puzzle
as shown in the figures:
Figure 2: Unsolvable states of a 2X2 puzzle
Checking solvability:
There is a method to check whether any given game state is solvable. The formula is:
1. If the grid width is odd, then the number of inversions in a solvable situation is even.
2. If the grid width is even, and the blank is on an even row counting from the bottom
(second-last, fourth-last etc), then the number of inversions in a solvable situation is
odd.
3. If the grid width is even, and the blank is on an odd row counting from the bottom
(last, third-last, fifth-last etc) then the number of inversions in a solvable situation is
even.
An inversion is when a tile precedes another tile with a lower number on it. For example, if,
in a 4 x 4 grid, number 12 is top left in the figure below, then there will be 11 inversions from
this tile, as numbers 1-11 come after it. The solution state has zero inversions. Consider the
tiles written out in a row, like this:
Actually, the formula is intuitive. Moving the blank tile left or right doesn’t change the
inversion because it doesn’t change the sequence of the numbers.
Any unsolvable state can’t generate a solution no matter how many moves are used. So there
are two sets of disjoint configuration: solvable and unsolvable. Using random movement, all
other solvable states can be generated from a solvable initial state. Similarly, all other
unsolvable states can be generated from any random unsolvable state. So the whole state-
space is divided into two equal and non-intersecting set.
Since just randomly assigning numbers in a puzzle could produce unsolvable state and need
additional checking for solvability, we use another approach to train our agent. In our
approach, we apply a number of full exploratory (ε=1) moves upon the goal state and
generate the initial state. It ensures that our initial state is always solvable.
However, to test the agent against input from outside source, we need to check the solvability
and it is easily done based on the discussion on solvability beforehand.
Design:
Why direct reinforcement learning is not feasible:
We have already calculated that the possible state-space for a sliding puzzle is (n^2)!, which
increased very rapidly with n. Suppose we use 4-byte floating point number to store the value
of every state. Then it will need about 16! x 4 ≃ 83.7 Terabytes of memory. Just a single
iteration of all the 16! ≃ 2x10^13 states would need hours even for a modern computer.
Because of this huge amount of memory and processor speed requirement, it is not possible
to directly use reinforcement learning to solve this game. We need to think of some clever
approach to reduce the state-space so that the calculation and storage requirement is feasible.
Our approach:
Our approach is based on how human mind solve large state-space. Instead of thinking about
all the state-space, we divide the problem into many local sub-problems and first try to solve
the local problems one by one. For our sliding puzzle, the first local goal could be just
bringing ‘1’ to the top-left position. Then the next sub-problem could be put 2 in the second
position while keeping 1 fixed and so on. It is not optimal, but it significantly reduces the
state-space requirement and the complexity of the problem.
We also divided our problem into local sub-problems and trained our agent for the first goal
only, then second goal only and so on.
Design Objective:
The main objective of this project is to be able to solve the sliding puzzle problem
irrespective of the dimensions provided to the solving agent. To do this the problem is
divided into sub problem and these sub problems are used to build larger dimensional
problem. Each sub problem trains separate agents. These agents are used in tandem to solve
the larger dimensional problem. The sub problems are designed in such a way that when used
in the smallest dimension also works in larger dimensions. In the game of sliding puzzle there
are situations which are similar and actions needed to solve it are also similar. We took
advantage of this to make the sub problems.
Problem Formulation:
The minimum grid sub problem is formulated as markov process and solved using SARSA(λ)
algorithm. We chose this algorithm is because the problem is somewhat similar to grid world
problem and inherent eligibility trace makes it easier to implement while making the process
faster. As the minimum grid consists of 3x2 grid there are total 6! states and only half of them
6!
are solvable. Hence there are total = 360 number of states which is fairly manageable for
2
tabular method.
The way that the minimum grid maps from the main grid, solved by an agent and then gets
back to the main grid is show in Figure 15.
1 2 8 8 9 1 2 2
1 1 1 0 1
1 2 8 2 9 1 2 3 4 5 6
1 1 0
2 9
0
3 5 1
4 2
Two items to
11 21 8 8 9 21
be sorted
20 9 20 11
4 2 4 3
For the 2nd sub problem the state space needs to be designed intelligently to become
dimension independent. In Figure 17 the two situations needs similar kind of actions to take
tile 1 to target positions. The thing that makes these two puzzles similar is the position of tile
1 with respect to target position and position of blank tile with respect to tile 1. Difference
between tile 1 and target position is positive ((1,1) in this case) in column and row direction
for smaller puzzle. Difference between tile 1 and target position is positive ((2,2) in this case)
in column and row direction for larger puzzle. Same is also true for blank to tile 1 position.
S1 S2 S3
X1 = CtargetTile – CtargetPosition X2 = CtargetTile – CblankTile lB = CtargetTile – 1
Y1 = RtargetTile – RtargetaPosition Y2 = RtargetTile – RblankTile rB = 9 – CtargetTile
tB = RtargetTile – 1
dB = 9 - RtargetTile
With this state space representation the task of taking a tile to target destination is also done
in SARSA(λ) algorithm. In each episode a random target tile is assigned and termination of
episode is defined when target tile reaches target position that is s1 = 9.
Challenges:
As we are using tabular method like SARSA(λ) in sub problem 1 we need to number each
state as a number such. The states are the different combinations of the number in minimum
grid. We need to assign a unique number to a unique combination of numbers. There is no
easy way to this. So we used this approach. For number 1 to 6 each possible combination is
stored in an array. Then for a state we transformed the grid in an one dimensional array and
matched it with the array containing the every possible combination of number from 1 to 6.
When a row matches with the state we take that row number as the unique number for the
state. This process is shown in figure 18.
row 1 1 2 3 4 5 State 1
3 5 1 4 2
3 5 1 4 2 State N
row N
At first we approach the first sub problem which is to solve puzzle in minimum grid. As the
6!
minimum grid is 3x2 total solvable state is = 360. We solved the problem using SARSA(λ).
2
We have chosen SARSA parameters value to be α = 0.4, γ = 0.9, λ = 0.6, ε = 0.8. The goal
state is when sequence is achieved. A sample input and output to an agent which sequence 5
consecutive number is shown below:
Figure 19: (a) random solvable puzzle (b) solved puzzle by the agent
Sub problem 2:
The next sub problem is to take a tile to its desired minimum grid. This task is somewhat
lengthy and complicated. As there is 3x3 puzzle there are 9! = 362880 different
combinations. To achieve good policy it is necessary to visit each possible combinations
several times. For our project tried to visit the states as many times as possible. Each time the
agent makes the target tile reduce the distance between target position and target tile a reward
of +3 is given. If the distance increases a –3 reward is given. Episode is ended when target
tile reaches target positions. The SARSA parameters are chosen to be SARSA parameters
value to be α = 0.3, γ = 0.9, λ = 0.6, ε = 0.4. After the agent is trained a sample performance
of the agent in taking ta tile to its desired location is shown in Figure 7. The red color
indicates the target position and the green color indicates the target tile. Here tile 50 has to sit
on tile 50’s location. The agent is able to do that after the movements it selects from the
policy it had learnt. It should be noted that the agent is trained on a 3x3 gird. Yet it is able to
perform on 9x9 grid because of the way the state space is defined.
(a) (b)
We learned how to reduce the large state-space to minimize the complexity of a real-world
problem and how to implement reinforcement learning with sub-optimal solutions to reach
the goal faster. We learned that in many cases, where it is very expensive to find optimal
policy, a combination of local sub-optimal policy may train the reinforcement agent much
faster and within a practical time limit.