You are on page 1of 15

Report of Final Project

Solving Sliding Puzzle Game Using Reinforcement Learning Technique

Course: ECE 517

Submitted by: Submitted on:

Md. Munir Hasan December 7, 2015

Mesbah Uddin
Abstract: The problem described in this project is a puzzle game where the goal is to sort the tiles in
ascending order. We implemented reinforcement learning technique to train an agent to solve. We tried to
create a general solution for puzzles of all size. Our approach for solving the problem was to divide the
problem into many local sub-problems and is based on unsupervised learning. We trained our agent for all
these local optimal policies and combined them together to generate the overall solution.
Introduction:
There has been numerous application of reinforcement learning in various arcade games.
Since most of these games can be modeled by Markov Process, these games have been a
standard platform to test the effectiveness of various reinforcement learning (RL) algorithms.
One of the main problems that limit the applicability of RL in many of these games is the
unmanageable amount of state space. There are many different ways in which this problem
can be tackled. One such way is intelligent state space representation that can dramatically
reduce the state space. Other ways can be performing training on small state space and using
it to on larger state space that is obtaining a policy that can be scaled. In this project we tried
to solve this problem in the game of sliding puzzle.

Background:
In a sliding puzzle game, the target is to move the pieces to create a sorted configuration in
ascending order. There is a blank piece in the configuration which can be used to shift or
move tile pieces. The most popular versions of this game are 15-puzzle (4x4) and 8-puzzle
(3x3) as shown in the figure.

Figure 1: 3x3 and 4x4 Sliding puzzle

State-space:

There is a possible (n2)! combinations of states for an n-by-n sliding puzzle. So for a 3X3
puzzle, the possible combination is 9! or 362880. For a 4X4 and 5X5, this number is almost
2x1013 and 1.5x1025.

Solvability:

Not all configurations of a sliding puzzle are solvable. It is very apparent from a 2X2 puzzle
as shown in the figures:
Figure 2: Unsolvable states of a 2X2 puzzle

Figure 3: Solvable states of a 2X2 puzzle

Checking solvability:

There is a method to check whether any given game state is solvable. The formula is:
1. If the grid width is odd, then the number of inversions in a solvable situation is even.
2. If the grid width is even, and the blank is on an even row counting from the bottom
(second-last, fourth-last etc), then the number of inversions in a solvable situation is
odd.
3. If the grid width is even, and the blank is on an odd row counting from the bottom
(last, third-last, fifth-last etc) then the number of inversions in a solvable situation is
even.
An inversion is when a tile precedes another tile with a lower number on it. For example, if,
in a 4 x 4 grid, number 12 is top left in the figure below, then there will be 11 inversions from
this tile, as numbers 1-11 come after it. The solution state has zero inversions. Consider the
tiles written out in a row, like this:

Figure 4: A sample configuration


For example, if, in a 4 x 4 grid, number 12 is top left, then there will be 11 inversions from
this tile, as numbers 1-11 come after it. Now we count the number of inversions in the grid.
For example, on the grid of this figure:
 the 12 gives us 11 inversions
 the 1 gives us none
 the 10 gives us 8 inversions
 the 2 gives us none
 the 7 gives us 4 inversions
 the 11 gives us 6 inversions
 the 4 gives us one inversion
 the 14 gives us 6
 the 5 gives us one
 the 9 gives us 3
 the 15 gives us 4
 the 8 gives us 2
 2 from the 13
 one from the 6
So there are 49 inversions in this example.

Actually, the formula is intuitive. Moving the blank tile left or right doesn’t change the
inversion because it doesn’t change the sequence of the numbers.

Figure 5: Move left or right doesn’t change the inversion


However, a move in up or down causes change in inversion. For a 3x3, suppose the
corresponding action in the figure:

Figure 6: Move top or bottom does change the inversion


The inversion of the left configuration = 7 + 0 + 3 + 2 + 2 + 1 + 0 + 0 = 15.
The inversion of the right configuration= 7 + 0 + 3 + 2 + 0 + 1 + 0 + 0 = 13.
So the inversion of two of the tiles has changed and by an even number (2). In general, for
nxn odd puzzle, (n-1) tiles are affected by the move. Since n is odd, (n-1) is even and the
change in inversion is always an even number. If the starting configuration of an odd nxn
puzzle like 3x3 puzzle is odd, then we will only reach other odd configurations by moving
the tiles as shown in this example. Now, since a final state has zero inversion, reaching a zero
inversion is not possible from an odd starting configuration for a sliding puzzle of nxn where
n is odd.
Using same intuition and mathematics can be used to understand the rules for an even sized
sliding puzzle.

How many solvable and unsolvable states:

Any unsolvable state can’t generate a solution no matter how many moves are used. So there
are two sets of disjoint configuration: solvable and unsolvable. Using random movement, all
other solvable states can be generated from a solvable initial state. Similarly, all other
unsolvable states can be generated from any random unsolvable state. So the whole state-
space is divided into two equal and non-intersecting set.

Generating random initial state:

Since just randomly assigning numbers in a puzzle could produce unsolvable state and need
additional checking for solvability, we use another approach to train our agent. In our
approach, we apply a number of full exploratory (ε=1) moves upon the goal state and
generate the initial state. It ensures that our initial state is always solvable.

However, to test the agent against input from outside source, we need to check the solvability
and it is easily done based on the discussion on solvability beforehand.

Design:
Why direct reinforcement learning is not feasible:
We have already calculated that the possible state-space for a sliding puzzle is (n^2)!, which
increased very rapidly with n. Suppose we use 4-byte floating point number to store the value
of every state. Then it will need about 16! x 4 ≃ 83.7 Terabytes of memory. Just a single
iteration of all the 16! ≃ 2x10^13 states would need hours even for a modern computer.
Because of this huge amount of memory and processor speed requirement, it is not possible
to directly use reinforcement learning to solve this game. We need to think of some clever
approach to reduce the state-space so that the calculation and storage requirement is feasible.

Our approach:
Our approach is based on how human mind solve large state-space. Instead of thinking about
all the state-space, we divide the problem into many local sub-problems and first try to solve
the local problems one by one. For our sliding puzzle, the first local goal could be just
bringing ‘1’ to the top-left position. Then the next sub-problem could be put 2 in the second
position while keeping 1 fixed and so on. It is not optimal, but it significantly reduces the
state-space requirement and the complexity of the problem.

We also divided our problem into local sub-problems and trained our agent for the first goal
only, then second goal only and so on.

Design Objective:
The main objective of this project is to be able to solve the sliding puzzle problem
irrespective of the dimensions provided to the solving agent. To do this the problem is
divided into sub problem and these sub problems are used to build larger dimensional
problem. Each sub problem trains separate agents. These agents are used in tandem to solve
the larger dimensional problem. The sub problems are designed in such a way that when used
in the smallest dimension also works in larger dimensions. In the game of sliding puzzle there
are situations which are similar and actions needed to solve it are also similar. We took
advantage of this to make the sub problems.

Figure 7: States that are similar


In Figure 7 this situation is shown. It can be seen that on the left puzzle the actions needed to
take tile 5 to upper right position is the same as taking tile 15 to upper right position. If we
concentrate on just taking tile 5 or tile 15 to its position then only manipulating the blank tile
around these tiles is sufficient and condition of other tiles are unimportant. This way the
region of interest (ROI) of puzzle reduces into a 3x2 grid space. We would call this 3x2 grid
a ‘minimum grid’ because this is the minimum space in which all of the actions can be
performed to take a tile to its desired position. The structure of minimum grid is shown in
Figure 8.

Figure 8: Minimum Grid


In this minimum grid solving sliding puzzle is relatively easy as the state space is greatly
reduced. We would define a minimum grid from the top left corner of a true position of tile.
For example tile 1 should be placed in the top left corner of a puzzle. We would define the
minimum grid for tile 1 taking the top left positon as the top left corner of the minimum grid
as shown in Figure 9.

Figure 9: Position of minimum grid


We would have to take the tile 1 to its position by a process which would be our next sub
problem. Let us now concentrate on the first sub problem of minimum grid. Once tile 1 is in
its minimum grid it may find that the next number of 1 is already in that grid as shown in
Figure 10.

Figure 10: Minimum grid sub problems


Now an agent who is trained to solve this is employed here to solve this. The minimum grid
is taken out of the main grid. An agent is employed to solve that gird into order. Then the grid
is taken back to the main grid. Now the agent who can solve this can also solve the problem
as shown in Figure 11.
Figure 11: Same agent can be used as Figure 10
The kind of states occurs over and over again the puzzle which needs same sets of actions to
solve which occurs inside minimum grid. Thus we can take advantage of this and use this to
solve larger dimensional problem. All we would have to do is to take a tile to its minimum
grid. Then apply an agent to solve the minimum grid. Then take the next tile to its minimum
grid and solve it using the agent. When a tile reaches its minimum grid it may find itself in 5
different situations. It may find that 5 sequential number is there. Another time it may find
that 4 sequential number is there. And they need different agents to solve. So we need 5 sets
of agents for the minimum grid. These situations are depicted in figure 12.

Figure 12: Different agents for minimum grid


Now our next sub problem is to take the tiles to its desired minimum grid. This task is very
different from the first sub problem. Hence we need a different agent to perform this task.
The visual representation of second sub problem is shown in figure 13. With the combination
of sub problem 1 and sub problem 2 the sliding puzzle can be effectively solved. This way
there is no constraint on the dimension number for the agent. This is supported by the fact
that a human player who can solve a 3x3 sliding puzzle problem can also solve 5x5 or 9x9
sliding puzzle problem without any further training. This is because all the necessary things
to learn solving sliding puzzle problem are there in 3x3 puzzle.
Figure 13: Second sub problem

Problem Formulation:

The minimum grid sub problem is formulated as markov process and solved using SARSA(λ)
algorithm. We chose this algorithm is because the problem is somewhat similar to grid world
problem and inherent eligibility trace makes it easier to implement while making the process
faster. As the minimum grid consists of 3x2 grid there are total 6! states and only half of them
6!
are solvable. Hence there are total = 360 number of states which is fairly manageable for
2
tabular method.

Figure 14: Sub Problem 1 formulation

The way that the minimum grid maps from the main grid, solved by an agent and then gets
back to the main grid is show in Figure 15.
1 2 8 8 9 1 2 2
1 1 1 0 1
1 2 8 2 9 1 2 3 4 5 6
1 1 0
2 9
0
3 5 1
4 2

Figure 15: Main grid to minimum grid mapping


And which agent to use for a particular minimum grid is chosen using how many sequential
number are there in the minimum grid as shown in Figure 16.

Two items to
11 21 8 8 9 21
be sorted

20 9 20 11

Agent 2 policy (π2)


3 5 1 is used 1 2 5

4 2 4 3

Figure 16: Choice of agents for minimum grid

For the 2nd sub problem the state space needs to be designed intelligently to become
dimension independent. In Figure 17 the two situations needs similar kind of actions to take
tile 1 to target positions. The thing that makes these two puzzles similar is the position of tile
1 with respect to target position and position of blank tile with respect to tile 1. Difference
between tile 1 and target position is positive ((1,1) in this case) in column and row direction
for smaller puzzle. Difference between tile 1 and target position is positive ((2,2) in this case)
in column and row direction for larger puzzle. Same is also true for blank to tile 1 position.

Figure 17: Sub problem 2 state observation


The states for the second problem is defined this way. We define the target tile to target
relative position in terms of positive or negative or zero distance in column and row direction
s1(r,c). Also we define blank tile to target tile relative position in terms of positive or
negative distance in column and row direction s2(r,c). Also we define if target tile is in the
border of puzzle or not using the relative position of border and target tile in terms of positive
or zero distance from 4 boundaries. The overall state space is three dimensional and
summarized in table 1.
Table 1

S1 S2 S3
X1 = CtargetTile – CtargetPosition X2 = CtargetTile – CblankTile lB = CtargetTile – 1
Y1 = RtargetTile – RtargetaPosition Y2 = RtargetTile – RblankTile rB = 9 – CtargetTile
tB = RtargetTile – 1
dB = 9 - RtargetTile

With this state space representation the task of taking a tile to target destination is also done
in SARSA(λ) algorithm. In each episode a random target tile is assigned and termination of
episode is defined when target tile reaches target position that is s1 = 9.

Challenges:
As we are using tabular method like SARSA(λ) in sub problem 1 we need to number each
state as a number such. The states are the different combinations of the number in minimum
grid. We need to assign a unique number to a unique combination of numbers. There is no
easy way to this. So we used this approach. For number 1 to 6 each possible combination is
stored in an array. Then for a state we transformed the grid in an one dimensional array and
matched it with the array containing the every possible combination of number from 1 to 6.
When a row matches with the state we take that row number as the unique number for the
state. This process is shown in figure 18.

row 1 1 2 3 4 5 State 1

3 5 1 4 2

3 5 1 4 2 State N
row N

Figure 18: State numbering for tabular structure


Experiment:
 Sub problem 1:

At first we approach the first sub problem which is to solve puzzle in minimum grid. As the
6!
minimum grid is 3x2 total solvable state is = 360. We solved the problem using SARSA(λ).
2
We have chosen SARSA parameters value to be α = 0.4, γ = 0.9, λ = 0.6, ε = 0.8. The goal
state is when sequence is achieved. A sample input and output to an agent which sequence 5
consecutive number is shown below:

Figure 19: (a) random solvable puzzle (b) solved puzzle by the agent

 Sub problem 2:

The next sub problem is to take a tile to its desired minimum grid. This task is somewhat
lengthy and complicated. As there is 3x3 puzzle there are 9! = 362880 different
combinations. To achieve good policy it is necessary to visit each possible combinations
several times. For our project tried to visit the states as many times as possible. Each time the
agent makes the target tile reduce the distance between target position and target tile a reward
of +3 is given. If the distance increases a –3 reward is given. Episode is ended when target
tile reaches target positions. The SARSA parameters are chosen to be SARSA parameters
value to be α = 0.3, γ = 0.9, λ = 0.6, ε = 0.4. After the agent is trained a sample performance
of the agent in taking ta tile to its desired location is shown in Figure 7. The red color
indicates the target position and the green color indicates the target tile. Here tile 50 has to sit
on tile 50’s location. The agent is able to do that after the movements it selects from the
policy it had learnt. It should be noted that the agent is trained on a 3x3 gird. Yet it is able to
perform on 9x9 grid because of the way the state space is defined.
(a) (b)

Figure 20: Sub problem 2 agent performance

Conclusion and Discussion:


We started from two different points to solve the problem. One is to find an optimal solution
for the 2X3 grid and applying it over and over again on the whole grid. We used overlapping
grid for this purpose. We also trained our agent to find another policy to move a tile to its
goal location optimally. Due to the shortage of time, we couldn’t post a complete result.

We learned how to reduce the large state-space to minimize the complexity of a real-world
problem and how to implement reinforcement learning with sub-optimal solutions to reach
the goal faster. We learned that in many cases, where it is very expensive to find optimal
policy, a combination of local sub-optimal policy may train the reinforcement agent much
faster and within a practical time limit.

You might also like