You are on page 1of 7

Available online at www.sciencedirect.

com

This space is reserved for the Procedia header, do not use it


This ScienceDirect
space is reserved for the Procedia header, do not use it
This space is reserved for the Procedia header, do not use it
This space is
Procedia reserved
reserved for
is Computer for the
123 Procedia header, do not use it
is reservedScience (2018) 347–353
This space the Procedia header, do not use it
This space for the Procedia header, do not use it
This space is reserved for the Procedia header, do not use it
8th Annual International Conference on Biologically Inspired Cognitive Architectures, BICA 2017
Grid Path Planning with Deep Reinforcement Learning:
Grid
Grid Path
Path Planning
Planning with
with Deep
Deep Reinforcement
Reinforcement Learning:
Learning:
Grid Path Planning Preliminary
with Deep Results
Reinforcement Learning:
Grid Path Planning with Deep
Preliminary Reinforcement
Results Learning:
1,2 Preliminary Results
Grid PathI. Planning
Aleksandr with Deep
Panov , Preliminary
Konstantin Reinforcement
Results
S. Yakovlev
1,2 Preliminary Results
1,2
, and RomanLearning:
Suvorov1
Aleksandr I. Panov1,2 , Konstantin S. Yakovlev1,2 , and Roman Suvorov11
1,2
Aleksandr
1 I. Panov
Federal Research Center1,2 , Preliminary
Konstantin
“Computer ScienceS.and
Yakovlev
Results of ,Russian
Control” 1,2 and Roman
AcademySuvorov
of Sciences,
1
Aleksandr I. Panov 1,2 , Konstantin S. Yakovlev1,2 , and Roman Suvorov1
1,2 1,2 1
1
1
Aleksandr I. Panov
Federal Research , Konstantin
Center “Computer ScienceS.and
Moscow, Yakovlev
Control” of ,Russian
Russia and Roman
AcademySuvorov
of Sciences,
Federal
2 Research Center “Computer Science and Control” of Russian Academy of Sciences,
National Research University Higher School of Economics, Moscow, Russia
1
1
Aleksandr I. Panov
Federal Research
Federal Research 1,2
Center “Computer
Center “Computer Moscow,
Science Russia
,{pan,yakovlev,rsuvorov}@isa.ru
Konstantin
ScienceS.and
andYakovlev 1,2
of ,Russian
Control” of
Control” and Roman
Russian AcademySuvorov
Academy 1
of Sciences,
of Sciences,
1
Federal
2 Research Center “Computer Moscow, Russia
Science and Control” of Russian Academy of Sciences,
2 National Research UniversityMoscow,
Higher School
Russia of
Russia Economics, Moscow, Russia
National Research UniversityMoscow,
Higher School
Moscow, Russia of Economics, Moscow, Russia
1 2
National
2 Research
Federal Research
Center {pan,yakovlev,rsuvorov}@isa.ru
University
University Higher School of Economics,
“Computer Higher
Science School
and of
Control” Moscow,
of Russian Russia
Academy of Sciences,
2 National Research {pan,yakovlev,rsuvorov}@isa.ru
Economics, Moscow, Russia
National Research {pan,yakovlev,rsuvorov}@isa.ru
University Higher School of Economics, Moscow, Russia
Moscow, Russia
{pan,yakovlev,rsuvorov}@isa.ru
2 {pan,yakovlev,rsuvorov}@isa.ru
Abstract National Research University Higher School of Economics, Moscow, Russia
Abstract {pan,yakovlev,rsuvorov}@isa.ru
Single-shot grid-based path finding is an important problem with the applications in robotics,
Abstract
video gamesgrid-based
Single-shot
Abstract
Abstract etc. Typically in AI community
path finding is an importantheuristic search
problem withmethods (based on in
the applications A*robotics,
and its
Single-shot
Abstract
variations) grid-based
are used to path it.
solve finding
In is work
this an important
we presentproblem
the withofthe
results applications
preliminary in robotics,
studies on how
video games
Single-shot
Single-shot etc. Typically
grid-based
grid-based path in
path finding AI
finding community
is an
is an importantheuristic
important search
problem
problem with
withmethods (based
the applications
the applicationson A*
in
in and its
robotics,
robotics,
video
neural games
Single-shot
networksetc.can
Typically
grid-based bepath in AI
finding
utilized to community
is
pathan heuristic
important
planning on search
problem
square withmethods
grids, the
e.g. (based
applications
how well on A*can
in
they andcope
its
robotics,
variations)
Abstract
video
video games are used
gamesareetc.
etc. to solve
Typically
Typically it. In
in In
in this
AIthis
AI work
community
community we present the
heuristicthe
heuristic results
search
search of preliminary
methods
methods studies
(based studies
(based on A*
on A* andon
andhow
its
its
variations)
video
with games
path used
etc.
finding to
taskssolve
Typically
by it.
in AI
themselves work
community
within we present
heuristic
the results
search
well-known of preliminary
methods
reinforcement (based on A* on
andhow
its
neural networks
Single-shot
variations)
variations) are
are can
grid-based
used
used besolve
to
to utilized
path
solve it.
it. tothis
finding
In
In path
this anplanning
is work
work important
we
we onproblem
present
present square
the
the grids,
withof
results
results e.g.
the
of howproblem
preliminary
preliminary instatement.
well studies
applications they
studiescanon
on cope
robotics,
how
how
neural
Conducted networks
variations) are can
used be
to
experiments utilized
solve
show it. to
In path
this workplanning
we on
present square
the grids,
results e.g.
of how well
preliminary they
studiescanon cope
how
with
video
neural
neural path
gamesfinding
networks
networks tasks
etc.can
can be by
Typically
be inthat
themselves
utilized
utilized AI
to the
path
path agent
within
to community using
the
planning
planning neural
well-known
heuristic
on
on square
squareQ-learning
search
grids,
grids, e.g.algorithm
reinforcement
methods
e.g. how
how wellrobustly
problem
(based
well on A*
they
they can
can learns
statement.
and its
cope
cope
with
neural
to path
achieve finding
networks
the goaltasks
can be
on by
smallthemselves
utilized to
maps path
andwithin the well-known
planning
demonstrate on square
promisingreinforcement
grids, e.g.
results how
on problem
well
the they
maps statement.
can
have cope
ben
Conducted
variations)
with
with experiments
are
path finding
path used
finding to show
solve
tasks by
tasks by it. that
In
themselves
themselves the
this agent
work
within weusing neural
present the
the well-known
well-known Q-learning
results of algorithm
preliminary
reinforcement robustly
studies
problem learns
on how
statement.
Conducted
with path experiments
finding tasks show
by that thewithin
themselves within
the
agent using
the well-known
reinforcement
neural Q-learning
reinforcement
problem
algorithm statement.
robustly learns
never
to
neural seen
achieve
Conducted
Conducted by
the
networks him
goal
canbefore.
on
experiments
experimentsbe small
utilized
show
show maps
that
that and
to the
path
the demonstrate
planning
agent
agent using
using on promising
square
neural
neural grids,results
Q-learning
Q-learning on problem
the
e.g.algorithm
how
algorithm theystatement.
wellmaps
robustly
robustlyhave ben
canlearns
cope
learns
to achieve the
Conducted goal on small
experiments show maps
that andagent
the demonstrate
using promising
neural results
Q-learning on the maps
algorithm robustlyhave ben
learns
never
with
to
to seen
path
achieve
achieve by
the
the him
finding
goal
goal before.
tasks
on by
on small
smallthemselves
maps and
maps andwithin the
demonstrate
demonstratewell-known
promising
promisingreinforcement
results
results onCC
on problem
the
the statement.
maps have
maps have ben
ben
© 2018
Keywords:
never
to The Authors.
seen
achieve path
by him
the goalPublished
planning,
before.
on smallby Elsevier
reinforcement
maps Ltd.
and This is an
learning,
demonstrateopen networks,
neural access article
promising under the
Q-learning,
results on BY-NC-ND
convolution
the maps license
networks,
have ben
Conducted
never seen
Q-network
Keywords:
experiments
by
path him
him show
before. that the
(http://creativecommons.org/licenses/by-nc-nd/3.0/).
never seen by before.
planning,
agent using neural Q-learning algorithm robustly
reinforcement learning, neural networks, Q-learning, convolution networks,
learns
never seen by him before.
Peer-review
to achieve under
Keywords:
Q-network
path responsibility
planning,
the goal on small of maps
the scientific
reinforcement committee
andlearning, of the
neural
demonstrate 8th Annual
networks,
promising International
Q-learning,
results Conference
onconvolution
the on ben
networks,
maps have
Keywords:
Keywords: path planning,
path planning, reinforcement
reinforcement learning,
learning, neural
neural networks,
networks, Q-learning,
Q-learning, convolution
convolution networks,
networks,
Biologically
Q-network
never seen Inspired
by him Cognitive
before. Architectures
Keywords: path planning, reinforcement learning, neural networks, Q-learning, convolution networks,
Q-network
Q-network
Q-network
1 Introduction
Keywords: path planning, reinforcement learning, neural networks, Q-learning, convolution networks,
1 Introduction
Q-network
1 Introduction
Finding a single path on a given static grid is a well-known and well-studied problem in AI,
1
1 Introduction
Introduction
planning
Finding aand robotics
single path communities [16] with
on a given static grida islarge variety of methods
a well-known and algorithms
and well-studied proposed
problem in AI,
Finding a single pathalgorithms
on a givenrely
static grid is a search
well-known and well-studied problem in grid
AI,
1
so Introduction
far.
planning
Finding
Finding
planning
Most
aand of
a single
Finding aand
these
robotics
single path
robotics
single
communities
on a given
path communities on
[16]
static
on a given static heuristic
with
grid
[16] with
path on a given static
a large
is a in
variety
well-known
grida islarge
a well-known the
of state-space
methods
and and induced
well-studied
and well-studied
variety of methods
grid is a well-known
by
algorithms
problem
problem
and algorithms
and well-studied
the
proposed
in
in AI,
AI,
proposed
problem in AI,
cells
so far.and
planning are of
Most
and based
theseon
robotics the well-known
algorithms
communities rely A*
onwith
[16] algorithm
heuristiclargesearch
aa large proposed
varietyin of
the in 1968and[4].
state-space
methods and Amongbythe
induced
algorithms themostgrid
planning
so far. Most
planning
wide-spread
and
and
robotics
of
robotics
algorithms
communities
these algorithms
communities
that solve
[16]
rely onwith
[16]
the heuristic
with
task aon-line,
large
variety
search
variety
e.g.
of
of
methods
inwithout
the state-space
methodsany and induced byproposed
algorithms
algorithms
preprocessing,
proposed
the grid
proposed
one can
cells
Finding
so
so and
far.
far.and a are
Most
Mostsingle
ofbased
path
these
ofbased on
theseon the
on a
algorithms
algorithmswell-known
given static
rely on A*
rely on A*gridalgorithm
is
heuristic
heuristic a search
search proposed
well-known
in the andin 1968
state-space
in the state-space [4].
well-studied Among
inducedproblem
by the
the most
in AI,
grid
cells
so
namefar.and
IDA* are[7],
Most of these
ARA* theJPS
algorithms
[8], well-known
rely
[3], onwith
Theta* algorithm
heuristic
[10] search
etc., all proposed
ofinwhich
the in 1968
state-space
are [4].induced
apparently Among
induced
bythe
themost
bythe
heuristicthe
grid
grid
search
wide-spread
planning
cells
cells and
are algorithms
robotics
based
and arealgorithms on that
communities
the
based on the solve
well-knownthe
[16] task
A* aon-line,
large
algorithm e.g.
variety without
of
proposed methods
in any preprocessing,
and
1968 algorithms
[4]. Among one
proposedcan
most
wide-spread
cells thatwell-known
solve the taskA* algorithm
on-line, e.g. proposed
without in any
1968preprocessing,
[4]. Among the onemostcan
so far.and
algorithms.
name IDA*
wide-spread are[7],
Most ofbased
ARA*
these on
algorithms the
[8], JPS
algorithms
that well-known
[3],
solve Theta*
rely
the A*
ontask algorithm
[10]
heuristic etc.,
on-line, all
search proposed
ofinwhich
e.g. in any
are 1968 [4].induced
apparently
the state-space
without Among bythe
heuristic
preprocessing, the most
search
grid
wide-spread
name IDA* [7],
wide-spread
algorithms
ARA* [8],
algorithms
that
that
solve
JPSsolve the
[3], Theta*
the
task
task[10]on-line, e.g.
etc., all
on-line,
without
of which
e.g. without
any
areany
apparently heuristicone
preprocessing,
preprocessing,
one
one
can
can
search
can
name
name At the
algorithms.
cells and
IDA* same
are[7], time
based
ARA* onrecently
the
[8], JPS we
[3],have
well-known witnessed
Theta*A* algorithm
[10] a
etc., huge
all break-through
proposed
of which in
are 1968 in applying
[4].
apparently
IDA* [7], ARA* [8], JPS [3], Theta* [10] etc., all of which are apparently heuristic search Among neural
the
heuristic net-
most
search
algorithms.
name IDA* [7], ARA* [8], JPS [3], Theta* [10] etc., all of which are apparently heuristic search
worksAt totheallsame
wide-spread
algorithms.
algorithms. sorts of tasks
time
algorithms within
recently
that weAI
solve havedomain,
the witnessedincluding
task on-line, a huge playing
e.g. the game
break-through
without inof applying
Go [11], recognizing
any preprocessing, neural
one net-
can
At the
algorithms.
objects fromsame time [13],
images recently we haveimages
generating witnessed and arepresentation
huge break-through learningin [2],
applying
machine neural net-
transla-
works
name At to
IDA*
At totheall sorts
[7],
same
theallsame of
ARA*
timetasks
[8], within
JPS
recently we AI
[3],havedomain,
Theta* [10]
witnessedincluding
etc.,
a all
huge playing
of which the
are
break-through game of
apparently
in Go [11],
applying recognizing
heuristic
neural search
net-
worksAt totheall sortstime
same time
recently
of tasks within
recently
weAI
weAI
have
have
witnessed
domain,
witnessed
a huge
including break-through
playing
thatthe game inof applying
Go [11], neural net-
recognizing
tion
works[17],
objects
algorithms.
works speech
tofrom
all sorts
sorts recognition
images
of [13],
of tasks
tasks [5],
generating
within
within etc.
AI One
images
domain,
domain, and arepresentation
common
including
including
huge
thingbreak-through
playing
playing thecan
the be in
learning
game
game of
of
applying
noticed
[2],
Gomachine
Go [11],
[11],
neural
is that net-
neural
transla-
recognizing
recognizing
objects
works
networks tofrom
all
cope images
sorts of
well [13],
tasks
with generating
within
all sorts AI images
domain,
of tasks and representation
including
when playing the learning
game [2],
of Go machine
[11], transla-
recognizing
tion At[17],
objects
objects
tion
the speech
from
from
[17],
same
speech
recognition
time
images
images recently
[13],
[13],
recognition
[5],
generating
generating etc.
we have
[5], etc.
One
witnessed
images
images
One and asensory
common
and
common
thing
huge orthat
representation
representation
thing
image
that
can
break-through data
learning
learning
can be
(e.g.
be innoticed
[2],
[2],
noticed
“raw
applying
machine pixels”)
is that
neural neural
is thattransla-
machine transla-
neural
is
net-
objects
used
works
tion as
networks tofrom
an
[17], cope
all images
input.well
sorts
speech of [13], all
Definitely
with
tasks
recognition generating
square
sorts
within
[5], AIof
etc. images
grids
tasks
domain,
One and sensory
containing
when representation
including
common only two
or
playing
thing types
image
thatthe learning
can of
data
game
be [2],
cells,
(e.g.
of Go
noticed machine
i.e. “raw
[11],
is transla-
traversable
pixels”)
recognizing
that and
neuralis
tion [17], cope
networks speech recognition
well with all sorts[5], etc. One when
of tasks common thingorthat
sensory imagecandata
be noticed
(e.g. “raw is that neural
pixels”) is
tion
used
objects[17],
as
networks an speech
from input.
cope recognition
Definitely
images
well [13],
with [5], etc.
square
generating
all sorts of One
grids
images
tasks common
containing
whenand thing
only twothat
representation
sensory or imagecan
types ofbe
learning
data noticed
cells,
[2],
(e.g.i.e. is that
traversable
machine
“raw neural
pixels”) and
transla- is
networks
used as an cope
input.well with all square
Definitely sorts ofgrids
tasks when sensory
containing only or image
two types data
of (e.g.
cells, i.e. “raw pixels”)
traversable andis
networks
tion
used [17],
as an cope
speech
input.well with all square
recognition
Definitely sorts
[5], ofgrids
etc. tasks
One when
common
containing sensory
thing
only orthat
two imagecan
types data
ofbe (e.g.
noticed
cells, i.e. “raw
is pixels”)
that
traversable neural
andis1
used as an input. Definitely square grids containing only two types of cells, i.e. traversable and
used as
networks an input.
cope Definitely
well Published
with all by square
sorts grids
of tasks containing
when only two types of cells, i.e. traversable and
1877-0509 © 2018 The Authors. Elsevier Ltd. This is ansensory or article
open access imageunder
datathe(e.g. “raw pixels”)
CC BY-NC-ND license is1
(http://creativecommons.org/licenses/by-nc-nd/3.0/).
used as an input. Definitely square grids containing only two types of cells, i.e. traversable and 1
11
Peer-review under responsibility of the scientific committee of the 8th Annual International Conference on Biologically Inspired 1
Cognitive Architectures
10.1016/j.procs.2018.01.054 1
348 Aleksandr
Grid Path Planning with Deep I. Panov Learning
Reinforcement et al. / Procedia Computer Science 123 (2018) 347–353
Panov, Yakovlev, Suvorov

untraversable, look like a perfect input to modern artificial neural network (NN) architectures,
such as convolutional NN and residual learning NN. Path finding problem in that case can be
viewed as a problem of learning which image pixels (grid cells) to identify as being part of the
sought path.
In this paper preliminary results of applying reinforcement learning machinery to 2D grid-
based path finding are presented. First, provided with the most commonly-used problem state-
ment we define the appropriate machine learning problem (Section 2). Second, we suggest a
variety of tools, e.g. various deep neural network architectures, to solve it (Section 3). And, fi-
nally, we evaluate their performance running a series of experiments in simulated environments
(Section 4). The results of the experiments are twofold. On the one hand it is undoubtful
that the proposed NN learns to plan paths, e.g. after the learning phase it is capable to solve
previously unseen instances. On the other hand the NN sometimes fail to find a path. In case
a path is found it is likely to be longer than the path found by A* algorithm, e.g. the shortest
one. Obtained results provide an opportunity to claim that although the NN cannot be seen as
an alternative to well-establishes path finding techniques so far, the application of various deep
neural network architectures to path finding tasks is a perspective line of research.

2 Problem statement
Consider an agent operating on the 8-connected grid composed of blocked and unblocked cells.
Blocked cells are considered to be ones (white pixels of the image) and unblocked zeroes (black
pixels of the image). Given the start and goal locations, which are tied to the centers of distinct
unblocked cells s and g, the task is to find a path π which is a sequence of adjacent traversable
cells starting with s and ending with g, π = π(s, g) = (s, succ(s), succ(succ(s)), . . . , g). Here
succ(x) stands for the successor of cell x and we enforce successors to be neighboring (adjacent)

cells. Cost of the move between x and succ(x) equals 1 in case its a cardinal move and 2 in
case its a diagonal move. The cost of the path is sum of all moves comprising π. In this work
we are not limiting ourselves to finding least cost paths only but paths with smaller costs are
preferable.
To transform the abovementioned path finding problem formalization, widely used in AI
community, to reinforcement learning problem we utilize a framework of agent-centered search
which interleaves planning and execution. Planning here means deciding which movement
action (going left, right, up, down, etc.) at = pt → pt+1 to perform next and execution stands
to performing this action as well as receiving feedback from the environment, e.g. a reward
that is a real number. We denote the position of the agent at time t as pt = (xt , yt ), where
x and y are grid cell coordinates. We consider only finite sequences of actions until the agent
reaches the destination g or the maximal amount of steps T is achieved.
2
Agent has a limited field of view, denoted st ∈ S, S = R(2d+1) . In this work we presume
that its a square window of the predefined size (2d + 1) × (2d + 1) (say 11 × 11, d = 5) with the
agent placed in its center (so the agent is capable of seeing in all four directions). When the
agent approaches a grid edge the portion of the st which is out of bounds is filled with ones (so
the outer map is considered to be an obstacle).
The reward at time t, e.g. rt , is calculated using the unknown to the agent function G(s, g, t),
where s and g are start and goal locations. This function as well as grid map, M , comprise the
model of the environment E = (M, G). The task of the agent operating in this environment
is twofold. First agent has to learn. During the learning phase agent aims at maximizing its
overall reward while acting in the environment, e.g. relocating on a grid map. Second agent has
to navigate to goal location on any previously unseen grid map he is put into without having

2
Aleksandr
Grid Path Planning with Deep I. Panov et
Reinforcement al. / Procedia Computer Science 123 (2018)
Learning 347–353
Panov, Yakovlev, Suvorov 349

access to reward function G.

2.1 Agent learning process


The goal of the agent is relocating on the map with receiving the maximal sum of rewards
from the environment generated according the algorithm G. We consider all feature rewards
for the moment t with discounted coefficient γ and summarized reward is calculated as Rt =
 T t −t
t =t γ rt . As it is customary in reinforcement learning [6, 12] we define the optimal action-
value function Q∗ (s, a) as a maximum of expectation of overall rewards gathered according the
strategy of action choosing π:
Q∗ (st , at ) = max E[Rt |st , at , π].
π

If we decide to choose the action with the maximal value it is well-known that the such problem
statement boils down to the Bellman equation:
 
∗ ∗
Q (s, a) = Est ∼S rt + γ max Q (st , at )|s, a .
at

To find the optimal action-value function we use Q-learning algorithm [15, 9] and approx-
imate the Q∗ (s, a) with parametrization θ given by weights of a neural network: Q∗ (s, a) ≈
Q(s, a; θ). A neural network learns minimizing the loss-function, e.g. mean squired error:
 
Li (θi ) = Es,a∼ρ(·) (yi − Q(s, a; θi ))2
where ρ(·) is a probability distribution over a training batch of observations and actions, yi is
a summarized reward predicted by a network for the next observation st+1 :
 
yi = Est ∼S rt + γ max Q(st+1 , at+1 ; θi−1 )|s, a .
at

The problem statement described is model free (in contrast to value iteration approaches [14])
and off-policy due to and don’t learn environment’s state transition function explicitly and use
fixed greedy strategy where at = arg maxa Q(s, a; θ).
It is known that neural networks are affected by correlations in the reinforcement data and
to decrease learning degradation we use technique of experience replay [9]. We remember all
steps mt = {(st , at , rt , st+1 )} during Ne episodes into memory D = {m1 , m2 , . . . }. The agent
life circle consists of two phase: acting and learning. During acting phase the agent remember
the steps and update memory only. During the learning phase it shuffles the memory and
uniformly selects a batch for learning until number of learning epochs will not be reached.

2.2 Environment reward policy


The algorithm of reward generation is the main dynamical part of the environment E = (G, M ).
The state of the environment presented by the map M doesn’t change after application of the
agent’s action. But the algorithm G gives different results depending on the current agent’s
position. To force the agent reach the final position with the shortest path a function G should
use knowledge about optimal path.
We consider the following implementation of the G algorithm:
 opt euq
 rat
αopt rt + αrat rt + αeuq rt , pt ← 0,
obs
G(s, g, t) = r , pt ← 1,

 tar
r , pt = g,

3
350 Aleksandr
Grid Path Planning with Deep I. Panov Learning
Reinforcement et al. / Procedia Computer Science 123 (2018) 347–353
Panov, Yakovlev, Suvorov


where αi = 1. Here to define a usual reward when current position pt is unblocked we use
three parameters. The parameter rtopt = lt − lt−1 defines the change of optimal distance l to the
final position in view of obstacles. The parameter rtrat = e−lt /l0 gives bigger rewards for cells
that are close to the final position. And the parameter rteuq = |pt − g| − |pt−1 − g| plays role
of regularizator forcing the agent to go directly to the final position and don’t pay attention
to obstacles. If in the current position pt the obstacle is located the agent receives punishment
robs . And obviously if it achieves the destination it receives the main part of the overall reward
rtar .

3 Neural architectures
In the article we use several deep neural implementation of Q-network. All of them contain
dense output layer with linear activation function. Dimension of the output is equal to number
of available actions for the agent (8 in our case). Linear activation function supports both
positive Q values and negative. The first layers of the network were convolution layers (2 or
3 layers) with sigmoid or ReLU activation function. The main purpose of using them is to
construct invariant representations of patterns occurring in agent’s visible area. Convolution
cores have size of 3 × 3 and have been shifted with 1 stride. After each convolution layers we
use dropout for regularization and overfitting avoidance. The second type of inner layers was
full connected network (from 3 to 5 layers) accepting flatten output after convolution layers.
Dense layers use tanh or ReLU activation function and 50 neurons in hidden part. Full schema
of the neural architecture is presented on the Fig.1.

Figure 1: An example of the map used in experiments

Learning process utilizes common parameters in deep learning: batches of samples,


ADADelta algorithm for gradient descent, mean squared error as a loss function, several epochs
(passes) over training data. We fit the network each 20 episodes.

4 Experiments
Experimental evaluation of the proposed method of learning for path finding was based on the
OpenAI Gym framework [1]1 . To quickly assess performance, we created a special set of small
(10 × 10 and 20 × 20) maps with obstacles of “inconvenient for A*” shape. We call obstacle to
be of “inconvenient shape”, if A* with greedy euclidean heuristic will most probably “get lost in
the maze” when trying to build a path around it. In other words, such obstacles increase search

1 Source code of the algorithm can be found in the repository https://github.com/cog-isa/deep-path

4
Aleksandr
Grid Path Planning with Deep I. Panov Learning
Reinforcement et al. / Procedia Computer Science 123 (2018) 347–353
Panov, Yakovlev, Suvorov 351

space size of “best-first” algorithms (see Fig.2). For each map from the set a task consisting of
a pair of valid starting and final positions are generated. Each training episode for the agent
includes one of the maps and one of corresponding tasks. After training the agent performance
was validated on the set of maps and tasks which are not viewed by the agent.

Figure 2: An example of the map used in experiments

To solve the problem of combination of exploitation and exploration we use -greedy strategy
when the agent chooses the random action with probability  and chooses the predicted action
with 1 −  probability. We use annealed strategy of  predicted decreasing the value of  during
the agent learning to decrease the degree of exploration behavior.
In our experiments we train the agent over the 3000 episodes or considered pairs of a map
and a task (Nep ∈ [1000, 5000]). The limit of steps for the agent on the map was 100 (Na = 100)
and we define the size of the observed area in such way that agent can view all map in any
position (d = 20). Another values of parameters were as follows:
• αopt = 0.8, alpharat = 0.1 - parameters of usual reward calculation,
• robs = −4, rtar = 10 - values of obstacle and target rewards,
• Ne = 10 - length of the memory,
• γ = 4 - descending parameter of overall reward.
To evaluate the agent’s performance we measured the following metrics during each episode:
• ratio of the path length built by the agent to the optimal path length as a ratio of unique
grid cells number in paths (we will refer to it as Mp ),
• ratio of the sum reward during episode to optimal (Mr ).
Fig.3,4 present results of some experiments. As one can see, the agent robustly learns to
achieve the destination. Best result was achieved using the neural network with 2 convolutional
and 5 fully-connected layers. Also, from these figures one can see that neural networks seem to
be not fully converged or tend to over fit on partial experience.

5
Grid Path Planning with Deep Reinforcement Learning Panov, Yakovlev, Suvorov
352 Aleksandr I. Panov et al. / Procedia Computer Science 123 (2018) 347–353

Figure 3: Examples of paths found by the agent during learning

Figure 4: Convergence of learning process: value of loss function is minimized (left part) and
paths become shorter

5 Summary
In this paper we presented and evaluated a new approach to neural network based path plan-
ning using modern deep learning techniques. Experiments show that despite the results are
promising, there is much space to improve. Obtained results provide an opportunity to claim
that although the NN cannot be seen as an alternative to well-establishes path finding tech-
niques so far, the application of various deep neural network architectures to path finding tasks
is a perspective line of research.

Acknowledgments This work was supported by the Russian Science Foundation (Project
No. 16-11-00048).

References
[1] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,
and Wojciech Zaremba. OpenAI Gym, 2016.

6
Aleksandr
Grid Path Planning with Deep I. Panov et
Reinforcement al. / Procedia Computer Science 123 (2018)
Learning 347–353
Panov, Yakovlev, Suvorov 353

[2] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Info-
gan: Interpretable representation learning by information maximizing generative adversarial nets.
CoRR, abs/1606.03657, 2016.
[3] D. Harabor and A. Grastien. An optimal any-angle pathfinding algorithm. In ICAPS 2013, 2013.
[4] P.E. Hart, N.J. Nilsson, and B. Raphael. A formal basis for the heuristic determination of minimum
cost paths. IEEE transactions on Systems Science and Cybernetics, 4(2):100–107, 1968.
[5] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke,
P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in
speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine,
29(6):82–97, Nov 2012.
[6] Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore. Reinforcement learning: A
survey. Journal of Artificial Intelligence Research, 4:237–285, 1996.
[7] R. E. Korf. Linear-space best-first search. Artificial Intelligence, 62(1):41–78, 1993.
[8] M. Likhachev, G. J. Gordon, and S. Thrun. Ara*: Anytime a* with provable bounds on sub-
optimality. Advances in Neural Information Processing Systems, 2003.
[9] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan
Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. CoRR,
abs/1312.5602, 2013.
[10] A. Nash, K. Daniel, S. Koenig, and A. Felner. Theta*: Any-angle path planning on grids. In
Proceedings of the National Conference on Artificial Intelligence, page 1177, 2007.
[11] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driess-
che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Diele-
man, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine
Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with
deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
[12] Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An Introduction. The MIT
Press, London, 2012.
[13] C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4, inception-resnet and the impact of residual
connections on learning. arXiv preprint arXiv:1602.07261, 2016.
[14] Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value Iteration Networks.
arXiv, pages 1–14, feb 2016.
[15] Christopher J. C. H. Watkins and Peter Dayan. Q-Learning. Machine Learning, 8(3):279–292,
1992.
[16] P. Yap. Grid-based path-finding. In Proceedings of 15th Conference of the Canadian Society for
Computational Studies of Intelligence, pages 44–55, 2002.
[17] Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. Transfer learning for low-resource
neural machine translation. CoRR, abs/1604.02201, 2016.

You might also like