You are on page 1of 6

Proceedings of the 2003 IEEHRSJ

InU. Conference on Intelligent Robots and Systems


Las Vegas, Nevada ' October 2003

Reinforcement Learning on a Omnidirectional Mobile Robot


Roland Hafner Martin Riedmiller
Roland.Hafner@udo.edu Martin.Riedmiller@udo.edu

Informatik Lehrstuhl 1
Universitit Dortmund
D a 2 2 1 Dortmund

Absrrucf-With this paper we describe a well suited,


4
scalable problem lor reinlorcement learning approaches in
the field o l mobile robots. We show a suitable representation
of the problem lor a reinforcement approach and present our - . 3 . I "
results with a model based standard algorithm. Two different
..~
I.."....i"
approximators for the value function are used, a grid based
approximator and a neural network based approximator.

I. INTRODUCTION Fig. I. A %hematic view of our o-dirmtional mobile m h t . Figure


a) shows the mechanical assembly of the m h t platform. In figure b) the
With this paper we describe a well suited, scalable side parts are removed for a bener view.
problem for reinforcement learning approaches in the field
of mobile robots. This learning task will serve us as a test
bed for the existing algorithms and the development of In figure 2 the kinematic of our robot is shown. The
new ones. In the following section the problem and the wheels are placed in the so called A-amy. This implies
definition of the learning task with various variations are that the main axes of the wheels are painvise placed in an
shown. After that we show that existing algorithms are ca- angle of 120'. Each wheel has the same distance to the
pable of solving the problem. two different approximators center of mass of the mobile robot, approximately located
are used to learn the defined tasks with a reinforcement in the center of the platform. To get the kinematic model
approach. more flexible against errors in the assembly we expanded
it by some parameters. As you can see in figure 2 wheels
11. PROBLEM DESCRIPTION 1 and 2 are supposed to have the same distance Lz to
A . The Omnidirectional Mobile Robot the center of mass of the mobile robot while the distance
The omnidirectional mobile robot and the assembly we L1 for wheel 3 can differ. In addition a parameter 6 is
use for this paper was developed in [I]. Figure 1 shows introduced that covers misalignments of the axes of the
a schematic view of the omnidirectional mobile robot wheels. This model is adopted from [3] where it is used
platform. It consist of three omnidirectional or mecanum to design a mobile robot for the small size league.
wheels, each driven from a motor with attached gear and The resulting kinematics is shown in (1).
encoder.

In contrast to differential drive mobile robots


omnidirectional mobile robots are capable of driving in
any direction not depending on their actual heading. This
may be a benefit for mobile robot, especially because this E. The Learning Task
enables a very flexible and fast movement of the robot The main idea is the robot learns to drive from position
without the need of maneuvering. A (covering the position and orientation of the robot)
to position B by appropriate influence on the speed of
With this mobile robot we want to participate in the each wheel. To make this definition of the learning task
RoboCup [8] middle size league, where we are already more precise and to allow a flexible and universally
participating in the simulation league. In the simulation representation we introduced a construct we named
league we are successfully using various machine learning "Target Frame'' (TF).
techniques [7][4][6]. reinforcement learning and the idea
of self learning systems is well known in fields of 1) The Target Frame OTF.: With the Target Frame
robotics [9] and in the RoboCup [2]. OTF we set up a general definition for our learning task
which is scalable and flexible.

0-7803-7860-1f03/$17.00 0 2003 IEEE 418


.
2) The Task in Derail.:
Task 1: the robot should learn how to drive to the
target position.
Task 2: the robot should leam how to drive to the
target position an avoid collisions with the ball placed
at a fixed position in the target frame.
While solving the task there should be an optimization

4*';
in respect of time resp. the number of steps for a solution.
OF THE LEARNING TASKS
111. REPRESENTATION
...... . ab To set up a reinforcement learning (RL) problem the
correlating Markov Decision Problem (MDP) has to be
.. ," . defined.
R*d 1

The framework we consider here is a standard


Markov Decision Process (MDP). An MDP is a 4-tuple,
Fig. 2. Ihe kinematie model of our onmidirectional mobile robot.
( S , A , T , r ) ,where S denotes a finite set of states, A
denotes the action space, T is a probabilistic transition
function T : S x A x S -) [0,1], that denotes the
For the learning task the robot determines its position probability for a transition from state s to state s' when
RTF in the frame OTF. As showed in (2) this position a certain action a is applied. Finally, r : S x A --f R
in the frame RTF consists of a vector of coordinates that is a reward function, that denotes the immediate reward
represent the distance from the target position. for applying a certain action a in a certain state s. In this
paper we consider only deterministic MDPs.
A. Basic Leaning Algorithm
RTF = As a basic learning algorithm we choose a model based
approach published in [lo]. This basic algorithm is a
What the robot has to leam is now reduced to the task trajectory and model based approach and complies with
of driving from any position in the target frame to its the following pseudo code.
origin, obtaining an orientation of zero with respect to The basic learning algorithm:
this frame. This is equivalent for minimizing the distance
from the robots position to the target position in any h i t system with initial state SO E So
coordinate system. To make use of this representation in While Not (stop condition) {
real world problems all we have to do is to place the Choose action
target frame in the desired position and orientation.
at= min (3)
oEA(8r)
This representation allows to specify various learning
tasks. Later we will add some dimensions to get an exact APP'Y
representation of the robots behavior. St+l = I(st,at) (4)
The main advantage is that we can code some task
specific constraints, e.g. obstacles, in this target frame. Adapt

As an example figure 3 shows an application of this Jk+~(st) = r(at,at)+ Jdst+d (5)


representation in a RoboCup scenario. In figure 3a the 1
robot is supposed to determine its own position, the A learning agent interacts with the system, resp. the
position of the ball and the position of the goal in some world or the omnidirectional robot, in a sequence of
coordinate frame. The desired position is said to be discrete time steps t = 0, 1,2,. ... In each time step the
located behind the ball with orientation to the goal. The agent gets a description of the actual state of the system
origin of the target frame and the orientation of its axes st E S, where S is the set of (possible) states of the
shown in figure 3b is therefore determined. A robot that system.
is capable of driving to the origin of the target frame and
to avoid collisions with the ball is also capable of solving The learning agent runs on trajectories through the
this problem. state space. This trajectories starts in a set of initial
startup positions. In every step the agent selects an action

419
i
a)/

Fig. 3. In figure a) the robot can determine its own position and the target position to reach with reference to the world h e . In this example the
target position is shown as the position behind the ball with the onenlation towards the goal. Io figure b) the same sihlalion is s h o w 4 but the mbot
detarminer its pose wilh reference to the TF.

Q E A(st) according to the actual value function J as We want the agent to get the robot in the target position.
shown in (3) and applies this action to the system. The This target position is defined as a set S+ of positions in a
system is then in a new state stfl. following the model small distance of the origin of the target frame. Equation
of the system f ( s , a ) . Based on the experience from (9) shows the definition we used for S+.
the reaction of the system the agents adapts the value
function and tries to get a better approximation of the
optimal value function as showed in (5). s+ = {S~ZTF,~TFE [-0.05,0.05],*TF E [-0.05,0.05]}
(9)
The stop condition is fulfilled if the robot mns in a not In the same way we defined the set of not allowed
allowed state from a set of states S- or a maximum time states S-. This set depends on the task we want to l e m .
horizon for a trajectory is reached. In the first task, where we don't consider any obstacles
we defined it as showed in (IO). The states not allowed
B. Repmentation of the considered Problem
are only that ones too far away from the target and thus
To represent the above described problem as an MDP not in the work space.
we use the target frame concept. Thus in each time step
the robot has a state st as defined in (6). In the second task we consider the hall as an obstacle.
Therefor we have to represent this obstacle in the set S-
= @PF = (Z$F,Y&F,B$F)T (6) as showed in (1 I). The ball as a obstacle is represented
We restricted this state space as shown in (7) to a work- as a fixed obstacle region in the working space. In the
ing space s".This means the robot learns to drive from moment we extended the ball with a convex hull of the
any orientation to a target position that is approximately robot.
in l m distance. This is a 3 dimensional continuous state
space.

SU'={SIZTF,YTF E [-1,1],*TF E [-PI,PII) (7)


To define a set of actions to allow the leaming agent to S- = {sls 4 Swan& E Ball} (11)
interact with the system we defined 27 primitive actions.
Using the above given definitions we can define the real
With this primitive actions the agent is possible to set
leaming target using the cost function r ( s ,a) as showed
the wheel speed for each wheel. Equation (8) shows the
in (12).
set of primitive actions used here. The parameter v was
set to IO radh to achieve a smooth movement of the
omnidirectional robot. r- , SES-
r ( s , a )= r(s) := r+ , SES+ (12)
rtrans , else

420
C. Appmxiniator
We used two function approximators in this work. The
first one is a implementation of a occupancy grid based
method used in [5].
The second approximator we used is a artificial neural
network. We used only nets with one hidden layer and 15
units.
IV. RESULTS
It is possible to learn the task with the mentioned
algorithm and both approximators with a simulated robot.
We choose two sets S" of startup positions. The
first is a train set, that contains 125 startup positions
on a 5*5*5 grid in the work space Sw. This startup
positions are starting points for trajectories of the learning
algorithm. The second is a test set, that contains 125
startup positions, too, but randomly distributed in the - e h " -
whole work space SW.Thisstartup positions are starting
points for test trajectories with the actual learned value
function.

To measure the quality of the learned function we


choose the average accumulated success and costs of all
trajectories per episode. In an episode the algorithm starts
in each point of the startup set. When in test mode the
"Adapt" step of the algorithm is not performed.

In figure 4 and figure 5 typical diagrams are shown


for learning task 1 using a neural network resp. the grid
Fig. 6. Example of lamed gsid approximation of a value function that
based approximator. As you can see, to learn task 1 bath solves tht second task.
approximators need approximately 25000 learning runs.

In figure 7 a neural net approximation of the value


function for the same task is shown. The neural net
generalized over the obstacle region, even though it is
not visited from the algorithm.

The results presented in this paper depends on the


3- bJ parameter of used approximators. The parameters of
Fig. 4. Learmng task 1 usmg a e.3based approximator. a) Average the grid based function approximator are primarily the
accumulated success, b) average accumulated costs. number and shapes of the lattices that cover the work
space. In this example this lattices are build by 51 nodes
Figure 6 shows an example of learning value function for the x- and y- coordinates resp. 19 nodes for the
for the second task, represented by the grid based heading. The neural network consists of 15 hidden units
approximator. You can see a slice through the value (3 input and 1 output unit), using the standard back
function with the orientation set to 0. In the target propagation algorithm with a learn rate of 0.05.
position a sinkhole has established. The region behind
this sinkhole is the obstacle region of the ball. The In figure 8 three learned trajectories of the robot are
algorithm won't visit this region and leaves it unchanged shown for the first learning task. Figure Sa shows the
on its standard d u e s because all positions there are in result with the grid based approximator. The trajectory
S- and this representation of the robot allows it to stop of the robot is shown as a track of black dots. The
in "no time" so that it won't run in this region even it orientation of the robot at each point of the trajectory is
runs toward it with high speed. marked as thin lines pointing in the robots heading.

421
a) b)
Fig. 9. Results for the second leaning task a) G i d based approxima-
tion. b) Neural network approximation. The obstacle region is shown as
a circle,
.
Fig. 7. Example of a learned neural net of a value function Lbat solves
the second task.

............... ...... ... .. .. .... . .. .. .

Fig. 10. The resulting mjeaories with the modified direct cost
h a i o n . a) The @d based approximator based agent has a g o d
odenLVion towards the ball. b) The results for the neural network based
-IOxilMtLlC

Fig. 8. Trajeaoner for the hrst task. The r o b 1 learned how to


drive to the center of the @et frame. a) Results with the grid based
approximator. b) Results with the neural network approximator. to have the robot face the ball with exactly this side.

To achieve this behavior we changed the definition of


The robot leamed to drive to the origin of the target the direct cost function from (12). To define this new
frame by using a very limited set of primitive actions. direct costs we defined a cost function r'(s,a) = r'(s)
Figure 9 shows the results for the second leaming task. that depends on the difference of the actual robot heading
The robot Ieamed to drive from any startup position to and the direction the ball is seen from the robot. This
the target position avoiding the obstacle region. Three difference is normalized to [0 : Rtrans]. The modified
resulting trajectories are shown in this figure, where direct costs are defined in (13).
a) are results from the grid based approximator and b)
are the results of the neural network approximator. As
one can see, the trajectories generated from the neural r- , s E S -
network are smoother, while the resulting trajectories r(s,a) = r ( s ) := r+ , S E S+ (13)
from the grid based approximator are a liale shorter. rtrons + r ' ( s ) , else
The resulting trajectories with this modified direa
The results for the second learning task promise good cost function are showed in figure 10. In figure 10a the
results for the application in the field of RoboCup soccer resulting trajectories from a grid based approximator are
playing robots. For these it is very important to drive shown. The robot leamed to have its heading towards the
behind the ball. That is, driving to a desired position ball in the whole trajectory.
relative to the ball and the goal without crashing in the
ball. This would eventually cause a bad result by putting As showed in 10b the neural network based approxima-
the hall in a direction near the own goal. The robot tor is not able to learn the modified value function with
learned to prevent this an will perform well in this task. the standard parameter used in our recent work.
A problem is that the orientation of the robot should V. CONCLUSION
be towards the ball, may be because the robot has a We showed the ability of standard reinforcement
preferred side for ball handling. It wold be more optimal learning algorithm to solve the defined tasks. The leamed

422
agent is able to control an omnidirectional mobile robot RoboCiip-2000: Robot Soccer World Cup IV, LNCS.
by setting the speed of its three wheels to drive to a Springer (2000).
desired position. By defining a obstacle region in the [8] RoboCup 2001: Robot Soccer World Cup V
presented representation of this learning task, the robot is Springer (2002).
able to avoid the collisions with a static obstacle. [Y] C. K. Tbam and R. W. Prager. Reinforcement Learn-
ing for Multi-Linked Manipulator Obstacle Avoidance
The final goal is to get the learning algorithm on the and Control. Cambridge University, 1993.
real robot. To achieve this the next step is to expand the [IO] J. Weisbrod and M. Riedmiller and M. Spott. Fy-
representation for a real dynamic model of the robot. To nesse: A hybrid architecture for self-leaming control.
achieve this we have to consider the actual wheel speeds In I. Cloete and J.M. Zurada, editors, Knowledge
in the state space. The actions for the learning agents are Based Neuro Computing. MIT Press, 1999.
accelerations. First results in this work are veIy promising.

Further work will be done on Q-learning for this task.


To reach the final goal of learning on the real hardware a
lot of work has to be performed, especially on speeding
up the learning process. The recent algorithm needs in it
standard form approximately 25000 runs in simulation. A
run lasts for 5 seconds, this would be approx. 35 hours.
This time is to high for learning an a real robot because
the hardware is not robust enough.

VI. REFERENCES

[l] Hafner, R.: Reinforcement Lemen auf einem mobilen


Roboter. Master Thesis, University of Karlsruhe, Ger-
many (2002).
[Z] A. Kleiner and M. Diet1 and B. Nebel. Towards a
Life-Long Learning Soccer Agent. Proceedings of
the International RoboCup Symposium ’02. Fukuoka,
Japan (2002).
[3] Subir Kumar Saha and Jorge Angeles and John Dar-
covich. The Design of Kinematically Isotropic Rolling
Robots with Omnidirectional Wheels. Mechanism and
Machine Theory. 1995, vol. 30, pages 1127-1 137.
141 A. Merke and M. Riedmiller. Karlsruhe Brainstormers
- a reinforcement learning way to robotic soccer 11.
In A. Birk, S . Coradeschi, and S. Tadokoro, editors,
RoboCup-2001: Robor Soccer World Cup V, LNCS,
pages 322-327. Springer (2001).
[5] Merke, A. and Schoknecht, R..: A Necessary Condi-
tion of Convergence in Reinforcement Learning with
Function Approximation. In International Conference
on Machine Learning (ICML 2002). Morgan Kauf-
mann (2002).
161 M. RiedmiUer and A. Merke. Using machine learn-
ing techniques in complex multi-agent domains. In
I. Stamatescu, W. Menzel, and U. Ratsch, editors,
Perspectives on Adoptivily ond Learning. Springer
(2002).
[7] M. Riedmiller, A. Merke, D. Meier, A. Hoff-
A. Sinner, 0. Thate, C. Kill, and R. E h ” . Karl-
sruhe Brainstonners - a reinforcement learning way to
robotic soccer. In A. Jennings and P. Stone, editors,

423

You might also like