Professional Documents
Culture Documents
Informatik Lehrstuhl 1
Universitit Dortmund
D a 2 2 1 Dortmund
4*';
in respect of time resp. the number of steps for a solution.
OF THE LEARNING TASKS
111. REPRESENTATION
...... . ab To set up a reinforcement learning (RL) problem the
correlating Markov Decision Problem (MDP) has to be
.. ," . defined.
R*d 1
419
i
a)/
Fig. 3. In figure a) the robot can determine its own position and the target position to reach with reference to the world h e . In this example the
target position is shown as the position behind the ball with the onenlation towards the goal. Io figure b) the same sihlalion is s h o w 4 but the mbot
detarminer its pose wilh reference to the TF.
Q E A(st) according to the actual value function J as We want the agent to get the robot in the target position.
shown in (3) and applies this action to the system. The This target position is defined as a set S+ of positions in a
system is then in a new state stfl. following the model small distance of the origin of the target frame. Equation
of the system f ( s , a ) . Based on the experience from (9) shows the definition we used for S+.
the reaction of the system the agents adapts the value
function and tries to get a better approximation of the
optimal value function as showed in (5). s+ = {S~ZTF,~TFE [-0.05,0.05],*TF E [-0.05,0.05]}
(9)
The stop condition is fulfilled if the robot mns in a not In the same way we defined the set of not allowed
allowed state from a set of states S- or a maximum time states S-. This set depends on the task we want to l e m .
horizon for a trajectory is reached. In the first task, where we don't consider any obstacles
we defined it as showed in (IO). The states not allowed
B. Repmentation of the considered Problem
are only that ones too far away from the target and thus
To represent the above described problem as an MDP not in the work space.
we use the target frame concept. Thus in each time step
the robot has a state st as defined in (6). In the second task we consider the hall as an obstacle.
Therefor we have to represent this obstacle in the set S-
= @PF = (Z$F,Y&F,B$F)T (6) as showed in (1 I). The ball as a obstacle is represented
We restricted this state space as shown in (7) to a work- as a fixed obstacle region in the working space. In the
ing space s".This means the robot learns to drive from moment we extended the ball with a convex hull of the
any orientation to a target position that is approximately robot.
in l m distance. This is a 3 dimensional continuous state
space.
420
C. Appmxiniator
We used two function approximators in this work. The
first one is a implementation of a occupancy grid based
method used in [5].
The second approximator we used is a artificial neural
network. We used only nets with one hidden layer and 15
units.
IV. RESULTS
It is possible to learn the task with the mentioned
algorithm and both approximators with a simulated robot.
We choose two sets S" of startup positions. The
first is a train set, that contains 125 startup positions
on a 5*5*5 grid in the work space Sw. This startup
positions are starting points for trajectories of the learning
algorithm. The second is a test set, that contains 125
startup positions, too, but randomly distributed in the - e h " -
whole work space SW.Thisstartup positions are starting
points for test trajectories with the actual learned value
function.
421
a) b)
Fig. 9. Results for the second leaning task a) G i d based approxima-
tion. b) Neural network approximation. The obstacle region is shown as
a circle,
.
Fig. 7. Example of a learned neural net of a value function Lbat solves
the second task.
Fig. 10. The resulting mjeaories with the modified direct cost
h a i o n . a) The @d based approximator based agent has a g o d
odenLVion towards the ball. b) The results for the neural network based
-IOxilMtLlC
422
agent is able to control an omnidirectional mobile robot RoboCiip-2000: Robot Soccer World Cup IV, LNCS.
by setting the speed of its three wheels to drive to a Springer (2000).
desired position. By defining a obstacle region in the [8] RoboCup 2001: Robot Soccer World Cup V
presented representation of this learning task, the robot is Springer (2002).
able to avoid the collisions with a static obstacle. [Y] C. K. Tbam and R. W. Prager. Reinforcement Learn-
ing for Multi-Linked Manipulator Obstacle Avoidance
The final goal is to get the learning algorithm on the and Control. Cambridge University, 1993.
real robot. To achieve this the next step is to expand the [IO] J. Weisbrod and M. Riedmiller and M. Spott. Fy-
representation for a real dynamic model of the robot. To nesse: A hybrid architecture for self-leaming control.
achieve this we have to consider the actual wheel speeds In I. Cloete and J.M. Zurada, editors, Knowledge
in the state space. The actions for the learning agents are Based Neuro Computing. MIT Press, 1999.
accelerations. First results in this work are veIy promising.
VI. REFERENCES
423