Professional Documents
Culture Documents
www.elsevier.com/locate/engappai
Abstract
This paper presents a machine-learning approach to the multi-robot coordination problem in an unknown dynamic environment.
A multi-robot object transportation task is employed as the platform to assess and validate this approach. Specifically, a flexible two-
layer multi-agent architecture is developed to implement multi-robot coordination. In this architecture, four software agents form a high-
level coordination subsystem while two heterogeneous robots constitute the low-level control subsystem. Two types of machine
learning—reinforcement learning (RL) and genetic algorithms (GAs)—are integrated to make decisions when the robots cooperatively
transport an object to a goal location while avoiding obstacles. A probabilistic arbitrator is used to determine the winning output
between the RL and GA algorithms. In particular, a modified RL algorithm called the sequential Q-learning algorithm is developed to
deal with the issues of behavior conflict that arise in multi-robot cooperative transportation tasks. The learning-based high-level
coordination subsystem sends commands to the low-level control subsystem, which is implemented with a hybrid force/position control
scheme. Simulation and experimental results are presented to demonstrate the effectiveness and adaptivity of the developed approach.
r 2007 Elsevier Ltd. All rights reserved.
Keywords: Multi-robot systems; Cooperative control; Q-learning; Genetic algorithms; Intelligent transportation; Multi-agent systems; Autonomous
robots
0952-1976/$ - see front matter r 2007 Elsevier Ltd. All rights reserved.
doi:10.1016/j.engappai.2007.05.006
Y. Wang, C.W. de Silva / Engineering Applications of Artificial Intelligence 21 (2008) 470–484 471
and in Yamada’s work (Yamada and Saito, 2001) where with a swarm of robots. Ferch and Zhang (2002) presented
they developed a multi-robot box-pushing system based on a cooperative manipulation project where two industrial
a reactive behavior architecture. robots learn cooperative grasping with RL. However, all
Huntsberger et al. (2003) proposed a behavior-based these developments assumed a static and known environ-
multi-robot architecture, CAMPOUT, for the multi-robot ment in the multi-robot systems.
work crew (RWC) project developed in JPL of NASA Arkin (Martinson et al., 2002; Martinson and Arkin,
(Schenker et al., 2003). CAMPOUT is a fully distributed, 2003) integrated the behavior-based approach with ma-
behavior-based multi-robot architecture. It employs the chine learning. They used the behavior-based representa-
leader–follower decentralized control strategy. CAMP- tion to reduce the learning space in Q-learning, and the
OUT was validated in a multi-robot cooperative transpor- robots learned to select their roles with Q-learning. A
tation task in a PV tent autonomous deployment project multi-robot foraging task validated their method.
on a planetary surface and a cliff robot project. In their Mataric (1997) also integrated machine learning with a
latest work, Stroupe et al. (2005) presented a multi-robot behavior-based architecture for multi-robot transportation and
autonomous construction task based on the CAMPOUT foraging tasks. In addition, she studied the effects of commu-
architecture. nication among robots on Q-learning. In their latest paper,
Wang (Wang et al., 2004, 2005) and Kumar (Pereira Jones and Mataric (2004) presented a formal framework for
et al., 2004; Sugar et al., 2001; Sugar and Kumar, 2002) describing and reasoning about coordination in a multi-robot
studied the caging formation control problem in multi- system, and developed a method for automatically synthesizing
robot cooperative transportation. They proposed the the controllers of robots. However, they assumed a simple
‘‘object closure’’ strategy to move the object with multiple environment with no obstacles in their research.
mobile robots while maintaining the formation. It seems popular to apply RL to multi-robot coordina-
A research field related to multi-robot object transporta- tion. However, the above examples using RL in the multi-
tion is multi-robot manipulation, where multiple industrial robot domain do not have strong theory basis. The
robotic manipulators cooperatively carry an object while environment in a multi-robot task is usually dynamic and
minimizing internal forces that get generated among it is affected simultaneously by multiple robots. Simply
robots. There has been much work done in this field. For extending the single-agent RL algorithm to the multi-robot
example, Kumar and Garg (2004) designed a fuzzy logic field violates its assumption of stationary environment
controller to control internal forces and moments while (Yang and Gu, 2004; Wang and de Silva, 2006a). However,
two industrial manipulators cooperatively carry an object. it is believed that this kind of extension is feasible if it is a
Another interesting topic in the field of multi-agent purely cooperative task such as multi-robot transportation
robotics is multi-robot learning (Stone and Veloso, 2000; (Yang and Gu, 2004).
Yang and Gu, 2004). The learning capability is desirable There are common shortcomings in the existing work in
for a multi-robot system. It will help the robots to cope the area of multi-robot object transportation, some of
with a dynamic or unknown environment, find the optimal which are listed below:
cooperation strategy, and make the entire system increas-
ingly flexible and autonomous. Although most of the 1. Inadequate emphasis on the learning mechanism in the
existing commercial multi-robot systems are controlled object transportation process.
remotely by a human, the autonomous performance will 2. No integration of learning and evolution. (Learning is
desirably be the objective of the next-generation systems. not optimal and also may fail in a dynamic and
Without the learning capability, it will be virtually im- unknown environment.)
possible for a robotic system to become a fully autonomous 3. Information (or, a model) on local dynamic/kinematic
system. Therefore, the introduction of machine-learning interactions between robots and the object is not
technologies into a multi-robot system has become a explicitly used.
necessity. 4. The control of robot contact forces is not included in the
Stone and Veloso (1998) suggested a layered approach overall control scheme.
with machine learning for multi-agent cooperation. In their 5. Implementations on physical robots are few and not
paper, neural networks were used to learn the low-level rigorous.
skills of a robot and the decision tree method was
employed to learn the high-level strategies. However, their In the present paper, a physical multi-robot system
work focused on the specific application to robotic soccer. integrated with machine learning is developed, for carrying
Inoue (Inoue et al., 2004) suggested to employ reinforce- out object transportation in a dynamic and unknown
ment learning (RL) in an object transportation task by two environment. The multi-agent architecture, machine learn-
humanoid robots. They used Q-learning to learn the ing, and hybrid force/position control are incorporated,
correct positions of the robots while the robots tightly enabling the system to operate in a robust and effective
cooperate to move the object. Ito (Ito and Gofuku, 2004) manner. The performance of the developed system is
integrated genetic algorithm (GA) with RL to learn action evaluated through computer simulation and laboratory
selection strategies in a cooperative transportation task experimentation.
472 Y. Wang, C.W. de Silva / Engineering Applications of Artificial Intelligence 21 (2008) 470–484
The main objective of the present paper, which is based on the optimal locations and amplitudes of the individual
on our previous work (Wang and de Silva, 2005, 2006a, b), forces applied by them so that the object is transported
is to develop a general learning-based multi-agent archi- quickly and effectively while avoiding any obstacles that
tecture for multi-robot coordination in a dynamic and may be present in the path. In some special cases, they also
unknown environment. The main contributions of the need to consider whether to remove the obstacles rather
work are as follows: First, the learning based multi-robot than negotiating around them. Other considerations such
architecture, which is developed here, supports multi-robot as the level of energy utilization and avoiding damage to
systems that work in a dynamic unknown environment. the transported object may have to be taken into account.
This addresses a major challenge that is faced by A global camera is used to monitor and measure the
researchers in the multi-robot community since previous current location and orientation of the object. The environ-
work, like the behavior-based and planning-based ap- ment is dynamic and unpredictable with some movable
proaches (Huntsberger et al., 2003; Parker, 2000; Miyata obstacles, which may appear randomly, and some fixed
et al., 2002), assumed a static or known environment obstacles. The robots, the camera, and the sensors are
for multi-robot transportation tasks. Second, a modified separately linked to their host computers, which are
Q-learning algorithm called the sequential Q-learning connected through a local network, to implement compli-
algorithm is developed, which is proved to be more suitable cated controls and machine intelligence.
than the conventional single-agent Q-learning algorithm A robotic arm and a mobile robot are used in the system
for decision making associated with multi-robot systems shown in Fig. 1. This system provides an opportunity to
that encounter resource or behavior conflicts. Third, GAs observe the cooperation between two heterogeneous robots
are integrated into the developed Q-learning algorithm for with different dynamics. Moreover, with the exact localiza-
coping with the local minimum problem. Last, to the best tion ability of the robotic arm and the broad ground
of our knowledge, the system developed in this paper is the coverage capability of the mobile robot, it is possible to
first machine-learning-based multi-robot physical trans- integrate the different advantages of the two robots in a
portation system that can work in a dynamic and unknown complementary manner, so as to improve the effectiveness
environment with complex obstacle distributions. of the overall system.
The primary objective of the work presented here is to The multi-agent architecture as outlined is used here as
develop a physical multi-robot system, where a group of the infrastructure to implement cooperative activities
intelligent robots work cooperatively to transport an object between the robots. This infrastructure is schematically
to a goal location and orientation in an unknown dynamic shown in Fig. 2.
environment. Obstacles may be present and even appear There are four software agents and two physical agents
randomly during the transportation process. Robot con- in this architecture, forming the overall multi-agent system.
trol, multi-agent approaches, and machine learning are Each agent possesses its own internal state (intention and
integrated into the developed physical platform to cope belief) and is equipped with own sensing and decision-
with the main challenges of the problem. A schematic making capabilities. They are also able to communicate
representation of the first version of the developed system with each other and exchange information on their
is shown in Fig. 1. environment as acquired through sensors, and the inten-
In this first prototype there is one fixed robotic arm and tions and actions of the peers. Based on the information
one mobile robot, which cooperatively transport an object from their own sensors and their internal states, the agents
to a goal location. During the course of the transportation, cooperatively determine a cooperation strategy to trans-
they have to negotiate the cooperation strategy and decide port the object.
Vision Camera
Learning/
Agent
Evolution Agent
High-level
Coordination
Robot Robot
Assistant Assistant
Agent #1 Agent #2
Physical Low-level
Physical
Agent #2 Control
Agent #1
Fig. 1. The developed system. Fig. 2. The multi-agent architecture used in the system.
Y. Wang, C.W. de Silva / Engineering Applications of Artificial Intelligence 21 (2008) 470–484 473
In Fig. 2, the four software agents constitute a high-level 4.1. Reinforcement learning
coordination subsystem. They will cooperate and coordi-
nate with each other to generate cooperation strategies, A key problem in a multi-robot transportation system is
and assign subtasks to the two robot assistant agents. In the determination of the cooperation strategy; i.e., finding
the meantime, the two physical agents (robots) form a low- the optimal amplitudes and locations of the applied forces
level control and execution subsystem, which receives of the robots so that a net force with maximum amplitude
commands from the high-level subsystem. points to the goal location, while avoiding any obstacles.
There is a vision agent, which is in charge of acquiring RL provides a way to determine an appropriate coopera-
and processing the images using a CCD camera. Because tion strategy through experimentation with the envi-
the present paper focuses on developing a learning-based ronment. In our previous work (Wang and de Silva,
multi-robot architecture, a simplified computer vision 2005, 2006b), a multi-robot box-pushing task based on
scheme is employed, which uses a global camera and the Q-learning was developed and validated. However, two
global localization technology. This vision agent will disadvantages were noticed when directly employing the
analyze the acquired images and compute the exact conventional Q-learning algorithm in tightly coupled
locations and orientations of the robots, the object, and multi-robot transportation tasks. One disadvantage is that
the obstacles. All the information is then broadcasted to there may exist behavior conflicts among robots when they
the other three software agents so that they will immedi- select actions concurrently. When two robots select the
ately know the current world state (the position and same action to push the box to the same location, they
orientation messages of the robots, the object and the will compete with each other, thereby violating the
obstacles, and the environment map information). cooperation strategy. The other disadvantage of directly
A special learning/evolution agent is used in the using Q-learning in multi-robot systems is that it results in
architecture to play the combined role of a learner, a a poor cooperation strategy and slow learning because one
monitor, and an advisor. The machine-learning algorithm robot does not consider the actions of the other robots
and the reasoning/decision-making capabilities are em- before it selects its own action. A possible solution to
bedded in this agent. First, it will collect from other agents this problem is to employ a team Q-learning algorithm
the information of the environment, the robots, the object, (Littman, 2001). However, such an algorithm will result in
and the obstacles to form the current world state, which an excessively large learning space in multi-robot tasks,
can be shared by its peers. Then it will make decisions on making it not practical (Wang and de Silva, 2006a). In
the optimal cooperation strategy based on its own knowl- order to overcome these two disadvantages, a modified
edge base so that the common task can be completed Q-learning algorithm called the sequential Q-learning
effectively and quickly. Finally, it will assign the roles of algorithm is developed here, which is suitable for multi-
other agents, and monitor the task schedule. robot systems with behavior conflicts. This algorithm may
Each physical agent (robot) is linked to a corresponding be summarized as follows:
assistant agent, which needs to forward the robot position/
force information to the other agents in the upper level, Assume that there are n robots, R1 ; R2 ; . . . ; Rn , which are
and to receive the desired force/trajectory of the robot from arranged in a special sequence, and their subscripts
the learning/evolution agent and send them to the represent their positions in this sequence.
corresponding robot in the low-level subsystem. L1 ; L2 ; . . . ; Ln are the corresponding action sets available
All six agents will form a tightly coupled multi-agent for the robots. In particular, the robot Ri has mi actions
system to transport the object cooperatively to the goal available for execution as Ri : Li ¼ ðai1 ; ai2 ; . . . ; aimi Þ.
location while avoiding obstacles. Q1 ðs; aÞ; Q2 ðs; aÞ; . . . ; Qn ðs; aÞ are the corresponding Q
tables for the robots. All entries in the Q tables are
4. Cooperation based on machine learning initialized to zero.
Initialize t to 0.99.
The learning ability is a desirable feature for a multi- C is a set including all actions selected by the robots so
robot transportation system. It will help the robots to cope far, and f represents the empty set.
with the dynamic changes in an unknown environment. Do repeatedly the following:
Although conventional planning algorithms can make the o Initialize C ¼ f
robots move effectively in a static environment, they o% Observe the current world state s
usually fail in an environment with dynamic obstacles. o% For (i ¼ 1 to n)
A multi-robot system with learning capability possesses % – Generate the currently available action set
characteristics of adaptation and self-organization, which Di ¼ ðLi ðLi \ CÞÞ
i
will enable it to overcome the difficulties in an unknown – The robot Ri selects the actioni aP jADi
= kr¼1 eQi ðs;ar Þ ,
i Qi ðs;aj Þ i
environment. In this paper, two well-known machine- with probability P(aj ) ¼ e
i
learning approaches, RL and GAs, are integrated into the where ar ADi ðr ¼ 1; 2; . . . ; kÞ and k is the size
learning/evolution agent to achieve robust and effective of the set Di
cooperation between the robots. – Add action aji to the set C
474 Y. Wang, C.W. de Silva / Engineering Applications of Artificial Intelligence 21 (2008) 470–484
after establishing that they are capable of carrying out this The fitness is calculated as follows:
sweeping task. 8
> if it is a ‘‘sweeping’’ action and some
The reward r is calculated as follows: >
> C
>
> removable obstacles are in the path:
r ¼ Dti Dti1 , (2) >
<
Fitness ¼ if a ‘‘sweeping’’ action without
where, Dti and Dti1 are the distances between the mass >
> 0
>
> removable obstacles in the path
center of the object and the goal location at times ti and >
>
:
ti1, respectively. M otherwise;
(4)
4.2. Genetic algorithms
k1 jF j þ k2 ð1 þ cos yÞ2 þ k3 =jGj þ k4 S þ k5 =ð1 þ cos tÞ2
Another approach that is used here to find the optimal M¼ ,
k 1 þ k 2 þ k 3 þ k 4 þ k5
cooperation strategy is GAs. By simulating the biological
(5)
evolution process, GA provides a powerful way to quickly
search and determine the optimal individual in a large where, C is a threshold value which is determined through
candidate space. trials, F is the net force applied by the robots, y is the angle
A typical GA as found in Karray and de Silva (2004) and between the net force vector and the goal location vector
Mitchell (1997) is given below: which is described in Fig. 3, G is the net torque applied by
the robots, S is the area of the polygon formed by the robot
Initialize p (the number of individuals in the population), locations, t is the angle between the net force vector and
Z (the fraction of the population to be replaced by the position vector of its closest obstacle, and k1–k5 are the
crossover at each step), m (the mutation rate), and weights (k1+k2+k3+k4+k5 ¼ 1).
fitness_threshold. The fitness function includes several practical considera-
Generate p individuals at random to produce the first tions. First, the ‘‘sweeping’’ action is always intended for
generation of population P use in the current world state. If a removable obstacle is
Calculate the fitness(i) for each i (individual) in P present in the forward path of the object (it is assumed
While ðmaxi fitnessðiÞÞofitness_threshold, then repeat: that, based on its size, color, shape, and so on, the agents
o Produce a new generation Ps: are able to judge whether the obstacle can be moved away
% – Probabilistically select (1Z)p members of P to be by the robots), the ‘‘sweeping’’ action will assume a high
included in Ps. The selection probability is given by fitness value so that the obstacle can be removed
immediately. In order for it to always have the opportunity
fitnessðiÞ to be selected, the ‘‘sweeping’’ action is included in the
lðiÞ ¼ (3) population of the first generation and it is always copied to
P
p
fitnessðjÞ the next generation in the evolution process.
j¼1
Second, a large net force that points to the goal location
– Execute the crossover operation to generate (Zp)/2 will earn a high fitness value so that the internal force
pairs of offspring and add them to Ps becomes minimum and the object is transported to the goal
– Mutate m percent of the numbers of Ps location quickly and effectively. Third, a high net torque G
– Update P’Ps is not encouraged, because the unnecessary rotation of the
– Calculate fitness(i) for each i in P object will make it difficult for the robots to handle it, while
Return the individual with the highest fitness in P. a small G may be encouraged for avoiding the obstacles.
Fourth, a large S implies that the spacing of the robots is
In this paper, the entire candidate space is formed quite adequate, which will receive a high fitness value,
by F ¼ ða1 ; b1 ; F 1 ; . . . ; ai ; bi ; F i ; . . . an ; bn ; F n Þ described in because crowding of the robots will make their operations
Fig. 4, where n is the number of robots, which is the set of more difficult and less effective. Finally, a small t means
all possible cooperation strategies in Fig. 4, and the that the probability of the presence of an obstacle in the
‘‘sweeping’’ action. This candidate space can be represented path is high, this condition should be punished. The
by a vector space R3n+1, which is made up of all possible weights k1k5 are calibrated through experimentation.
actions: ðF; sweepingÞ as given in Section 4.1. The GA given In each step of object transportation, the GA is used to
above is used to search for the optimal individual vector; search for the optimal cooperation strategy. Then this
i.e., the optimal robot cooperation strategy, in this large strategy is broadcasted to the two robot assistant agents by
vector space. the learning/evolution agent. Next, the robot assistant
In the beginning, the ‘‘sweeping’’ action and nine agents forward the commands to the physical robots.
individuals randomly selected from the candidate space Based on their internal hybrid force/position controllers,
are used to constitute the first generation of population. the robots then move themselves to the desired position
Then the GA outlined before is used to search for and transport the object for a specified time period. In the
the optimal individual with the maximum fitness value. meantime, the current force and position information is
476 Y. Wang, C.W. de Silva / Engineering Applications of Artificial Intelligence 21 (2008) 470–484
sent back to the high-level coordination subsystem via the However, by integrating GAs with RL, different
robot assistant agents for decision making in the next step. individuals can be inserted into the population of GA
After the object is moved through some distance, the world through RL, because RL uses an entirely different mecha-
state would be changed (for example, new obstacles are nism to select the new cooperation strategy. Consequently,
observed and so on). In this manner, the GA is used again it would be possible to disturb the population purity in a
to search for the new optimal cooperation strategy under GA and resolve the ‘‘crowding’’ problem in part. In
the new world state. This process will be repeated until the addition, RL is able to deal with quick changes in the
object is successfully moved to the goal location. environment, where GA usually fails.
The main problem of GA in a physical robotic system is The combination of learning with evolution, which are
that its real-time performance cannot be guaranteed two important adaptive mechanisms that exist in nature,
because of its low convergence speed. However, in the gives rise to distinct advantages, as discussed before. In this
present paper, this problem is solved partly by using the paper, the specific integration scheme indicated in Fig. 5
layered architecture shown in Fig. 2. In this two-layer is implemented in the multi-agent object transportation
multi-agent architecture, the low-level physical agents can system.
continue to execute the old commands before the new In Fig. 5, the sensor data are first sent to a sensory fusion
commands are given by the agents in the upper level. In module. This information will form the current positions
other words, the agents in the upper level are not necessary and orientations of the object, the robots and the obstacles,
to work at the speed of the physical agents in the lower and the state of the environment. Moreover, there is a
level. This enables a ‘‘low-speed’’ GA agent to make model knowledge base in which the geometry and dynamic
decisions for a real-time task. parameters of the object and the robots are stored. These
model parameters are determined off-line and stored in the
4.3. Integration of learning and evolution model knowledge base prior to operating the system. When
the sensory data is sent to a local modeling module, the
RL may face difficulties and generate non-optimal module will identify and update the current model
outputs. A typical problem is that ‘‘the agent runs the risk parameters by combing the off-line information in the
of over-committing to actions that are found to have high knowledge base and the new sensory data. The information
Q values during the early stages of training, while failing to from the sensory fusion module and the model knowledge
explore other actions that also have high values’’ (Mitchell, base is then integrated to form the new world state, as
1997). Another problem is that RL may not be able to described in Sections 4.1 and 4.2.
capture the slow changes in the environment. In other The two decision-making mechanisms, RL and GA, are
words, when the environment changes very slowly, RL will implemented based on the algorithms presented in the
have less opportunity to try the actions under different previous sections. They both are able to determine a robot
world states. Consequently, the knowledge base of RL will cooperation strategy according to the current world state.
be updated very slowly. It means that the learning process Their outputs are probabilistically selected by an arbi-
will become very slow and the output of RL will not be trator, which determines the winning output. In this paper,
optimal. it is assumed that the output of the RL block is selected by
GAs can partly overcome these problems. Because of the the arbitrator with a probability of 0.7, while the output
special evolution operations such as crossover and muta- of the GA block is selected with a probability of 0.3.
tion in GAs, it is possible for GAs to select a new search
point in the candidate space, which had not been tried in
the past steps (Nolfi and Floreano, 1999). This will provide Model KB
an opportunity for the actions with low-Q-values in RL to Sensors GA
demonstrate their capabilities. Moreover, because GAs Knowledge
simulate the biological evolution process, they provide Base (KB)
better adaptivity than RL and are able to capture slow Sensory Robot and Object
changes in the environment. Fusion Local Modeling
On the other hand, GAs also benefit from RL because Probabilistic
Genetic
GAs cannot guarantee a precisely optimal output. In World State Algorithms Arbitrator Action
particular, in the system developed in the present work, Extraction at time t
World state Action j
the crowding phenomena (Mitchell, 1997) was observed, at time t
where ‘‘very similar individuals take over a large frac-
Reinforcement
tion of the population’’. In addition, GAs are unable Learning Action i
to deal with quick changes in the environment because
the evolution process usually requires a considerable
RL KB Action Performance
duration of time. All these problems greatly slow down Rewarding Evaluation
the evolution process and make the output of the GAs
unacceptable. Fig. 5. The integration scheme of RL and GA.
Y. Wang, C.W. de Silva / Engineering Applications of Artificial Intelligence 21 (2008) 470–484 477
The output of the arbitrator, i.e., the cooperation strategy, of the corresponding robot. As the supervisor in the upper
is then broadcasted to the two robot assistant agents by level for the low-level control subsystem, the agent then
the learning/evolution agent to implement the object sends commands to the low-level position controller to
transportation. initiate the first control step, which is to move the robots to
the desired position and orientation. Next, the agent starts
5. Robot hybrid force/position control the second control step, which requires the position
controller to continue the position control strategy,
Force/position control of robots is indispensable in a while asking the force controller to execute the desired
physical multi-robot transportation system. Without prop- force control strategy in a specific direction in order for the
er control of the forces and positions of the robots, the robot to exert a constant pushing or pulling force on the
cooperation strategy described in Section 4 may not be object.
effective. The hybrid force/position control scheme shown
in Fig. 6 is implemented in the low-level control subsystem 6. Simulations
in the multi-robot system of the present work.
In Fig. 6, two sets of control modes, position control and 6.1. Basic results
force control, are integrated. Here L and L0 are the
direction constraint matrices (Craig, 2005), which, respec- In order to verify and validate the effectiveness of the
tively, determine in which direction the force control mode integrated learning/evolution decision-making approach
is used and in which directions the position control mode is outlined before, a simulation system based on Java multi-
employed. They are diagonal matrices with ones and zeros thread programming is developed. This system will
on the diagonal. For a diagonal element of ‘‘1’’ in L, there simulate the multi-robot object transportation processes
is a ‘‘0’’ as the corresponding diagonal element of L0 , and assess the feasibility of the learning/evolution-based
indicating that position control is in effect. On the other multi-agent architecture.
hand, if a ‘‘0’’ is present on the diagonal of L, then ‘‘1’’ will In Fig. 7, a rectangular environment with dimensions of
be present as the corresponding element of L0 , indicating 950 700 pixels is assumed. Several fixed obstacles are
that force control is in effect. Note that L and L0 are set by distributed randomly in the environment. In addition, in
the robot assistant agent in the upper level. Moreover, the order to alleviate the programming burden, the object is
robot assistant agent will provide the desired forces and simplified as a rectangular box with dimensions of 120 80
positions for the low-level control, based on the current pixels. There are two robots, denoted by circles with a
cooperation strategy broadcasted by the learning/evolution radius of 5 in the environment, in the subsequent figures,
agent. which will push the box to the goal location while avoiding
In addition, an adaptive control mechanism (Astrom the obstacles. It is assumed that the box is sufficiently big
and Wittenmark, 1994) is integrated into the control and heavy so that no individual robot can handle it without
scheme. In each control loop, the robot dynamic model is the help of the peer robot. The Java multi-thread
identified on-line and this information is used to tune the programming technology is used to implement the parallel
controller parameters. operations of the robots.
Finally, a two-step control strategy is used in the Some parameters of the simulation system are listed
practical control process. When the robot assistant agent below: e ¼ 0.9, m ¼ 0.8, p ¼ 10, Z ¼ 0.2, M ¼ 0.5, C ¼ 10,
receives a new cooperation strategy from the learning/ k1 ¼ 0.4, k2 ¼ 0.3, k3 ¼ 0.05, k4 ¼ 0.1, k5 ¼ 0.15, the
evolution agent, it computes the desired force and position detection radius is 72, and the total number of rounds in
the training stage is 1000. After training, two rounds of
box-pushing are presented in Fig. 8.
q Parameter Model
Tuning Identification
-
Position L
Controller
+
Random Obstacles Goal Location
qd Direction
q
Constraints Robot
Assistant Robot Box
Agent f
fd
(Supervisor)
+
Force L’
- Controller
Parameter Model Robots
f Tuning Identification
Fig. 6. The robot hybrid force/position control scheme. Fig. 7. The simulation system of the multi-robot transportation.
478 Y. Wang, C.W. de Silva / Engineering Applications of Artificial Intelligence 21 (2008) 470–484
learned knowledge. Every time a round of box-pushing was value of 500 steps to push the box from the starting point
completed, the saved final Q tables were reloaded into the to the goal location, while the latter has spent a mean
robots to start the next round of task. When 500 rounds of value of 1050 steps to complete the same task in the same
box-pushing were completed, the number of steps per environment.
round was analyzed statistically. The results are presented It may be concluded from Figs. 10 and 11 that the
in Fig. 11. It is seen that the sequential Q-learning sequential Q-learning algorithm developed in this paper
algorithm has spent a shorter time to complete the task demonstrates better performance than the conventional
than what was taken by conventional single-agent single-agent Q-learning algorithm in the multi-robot
Q-learning. In particular, the former has spent a mean cooperative box-pushing task described here. The devel-
oped algorithm is particularly suitable for decision making
in a multi-robot domain with resource or behavior
High-level Coordination conflicts.
Multi-agent architecture,
Learning, Evolution
7. Testbed and system architecture
Fig. 14. The robots move to the desired position and push the box under hybrid force/position control. (a) The initial state. (b) Moving to the desired
position (c) Pushing of the box.
Y. Wang, C.W. de Silva / Engineering Applications of Artificial Intelligence 21 (2008) 470–484 481
The mobile robot is a two-wheel-driven intelligent addition, the colors and shapes of the blobs are used to
machine with two 500-tick encoders, eight sonar units, identify whether an obstacle is fixed or removable.
wireless LAN communication ability, and a force sensor in A C++ program is developed to integrate data
its end effector. The mobile robot is linked to the host acquisition, sensory fusion, data analysis, image grabbing
computer with the wireless LAN via a CISCOTM wireless and processing, multi-agent architecture, learning/evolu-
Ethernet access point equipment. It can wander within a tion-based decision making, and hybrid force/position
radius of 50 m without any communication interruption. digital control of the robots. Moreover, the multi-thread
A color CCD camera with a resolution of 768 492 programming technology based on Microsoft Windows
pixels is used to monitor the positions and orientations of 2000 platform and RTX real-time OS is used to implement
the box, the obstacles, and the mobile robot. The vision parallel running of the multiple agents. The software/
images are captured into the computer by a frame grabber hardware architecture is presented in Fig. 13.
board at a sampling rate of 30 MHz. There are color blobs The experimental system architecture shown in Fig. 13
on the top surfaces of the box, the mobile robot, and the includes three levels of components: hardware layer, low-
obstacles so that their orientations and locations are easily level control layer, and high-level coordination layer. The
monitored by the vision agent outlined in Fig. 2. In force sensors, the optical encoders and motors of the
Fig. 15. Position control of the robotic arm. (a) The trajectory of the end effector. (b) The x coordinate of the end effector. (c) The y coordinate of the end
effector. (d) The speed of the end effector.
482 Y. Wang, C.W. de Silva / Engineering Applications of Artificial Intelligence 21 (2008) 470–484
robots, the sonar units, and the CCD camera are in the From Fig. 17 it is seen that when the robots detect the
hardware layer. They form the basic hardware of the existence of a removable obstacle in the forward path (this
experimental system. information is given by the vision agent), they select the
In the low-level control layer, the data acquisition board, ‘‘sweeping’’ action. In the ‘‘sweeping’’ process, the two
the frame grabber board, and the mobile robot CPU are robots are assigned different roles, where the mobile robot
used to acquire information, analyze it, and control the soon leaves the box, moves towards the obstacle, and
components in the hardware layer. There are two software pushes it away from the path. While the mobile robot is
controllers in this layer to implement real-time control of sweeping the obstacle, the robotic arm just stays in the
the two robots so that they can move to the desired same place, waiting for its peer to return. Once the vision
positions while exerting the desired contact forces. In agent reports that the obstacle has been swept away, the
addition, a computer vision program is developed to track mobile robot returns and continues the task of box
the latest positions and orientations of the robots, the transportation in cooperation with the robotic arm.
object, and the obstacles. This experiment demonstrates that the multi-agent
In the high-level coordination layer, the multi-agent robotic system developed in the present work, equipped
architecture and the learning/evolution-based algorithm with a machine-learning capability, exhibits good flexibility
are implemented to search for the optimal cooperation and adaptivity in complex environments, while effectively
strategy so that the object is transported effectively while carrying out an intended task.
avoiding the obstacles in the environment.
Fig. 17. The robots selected the ‘‘sweeping’’ action to remove an obstacle in the path. (a) Initial state (pushing of the box). (b) Leaving the box.
(c) Removing the obstacle. (d) Returning.
9. Conclusions Cao, Y.U., Fukunaga, A.S., Kahng, A.B., 1997. Cooperative mobile
robotics: antecedents and directions. Autonomous Robots 4 (1), 7–27.
This paper presents a multi-robot object transportation Craig, J.J., 2005. Introduction to Robotics: Mechanics and Control.
Pearson Prentice-Hall, Upper Saddle River, NJ.
system developed by the authors in the IAL. A new multi- Ferch, M., Zhang, J., 2002. Learning cooperative grasping with the graph
agent architecture was designed for this purpose, which representation of a state–action space. Robotics and Autonomous
integrated RL and GAs, together with a robotic hybrid Systems 38 (2002), 183–195.
force/position control scheme. A modified Q-learning Huntsberger, T., Pirjanian, P., et al., 2003. CAMPOUT: a control
algorithm called the sequential Q-learning algorithm was architecture for tightly coupled coordination of multirobot systems for
planetary surface exploration. IEEE Transactions on Systems, Man
developed in this paper. The empirical results showed that
and Cybernetics—Part A 33 (5), 550–559.
the algorithm was particularly suitable for decision making Inoue, Y., Tohge, T., Iba, H., 2004. Object transportation by two
in multi-robot systems with resource or behavior conflicts. humanoid robots using cooperative learning. In: Proceedings of the
The details of these developments and implementation in Congress on Evolutionary Computation (CEC 2004), Portland, OR,
the object transportation system were presented. Computer 2004, pp. 1201–1208.
simulations using Java language and experimentation using Ito, K., Gofuku, A., 2004. Hybrid autonomous control for multi mobile
robots. Advanced Robotics 18 (1), 83–99.
the developed prototype were used to demonstrate the Jones, C., Mataric, M.J., 2004. Automatic synthesis of communication-
feasibility and the effectiveness of the system. based coordinated multi-robot systems. In: Proceedings of the 2004
IEEE/RSJ International Conference on Intelligent Robots and
Acknowledgments Systems, Sendai, Japan, 2004, pp. 381–387.
Karray, F.O., de Silva, C.W., 2004. Soft Computing and Intelligent
Systems Design. Pearson Addison Wesley, New York.
Funding for the work reported in this paper has come Kumar, M., Garg, D.P., 2004. Sensor-based estimation and control of
from the National Science and Engineering Research forces and moments in multiple cooperative robots. Journal of
Council (NSERC) of Canada, Canada Foundation for Dynamic Systems, Measurement, and Control, Transactions of the
Innovation (CFI), and the K.C. Wong Education Founda- ASME 126 (2004), 276–283.
tion of Hong Kong. Littman, M.L., 2001. Value-function reinforcement learning in Markov
games. Journal of Cognitive Systems Research 2 (1), 55–66.
Liu, J., Wu, J., 2001. Multi-Agent Robotic Systems. CRC Press, Boca
References Raton, FL.
Martinson, E., Arkin, R.C., 2003. Learning to role-switch in multi-
Arai, T., Pagello, E., Parker, L.E. (Eds.), 2002. Guest editorial: advances robot systems. In: Proceedings of the 2003 IEEE International
in multirobot systems, IEEE Transactions on Robotics and Automa- Conference on Robotics and Automation, Taipei, Taiwan, 2003,
tion 18 (5), 655–661. pp. 2727–2734.
Astrom, K.J., Wittenmark, B., 1994. Adaptive Control, second ed. Martinson, E., Stoytchev, A., Arkin, R.C., 2002. Robot behavioral
Addison-Wesley, Menlo Park, CA. selection using Q-learning. In: Proceedings of the 2002 IEEE/RSJ
484 Y. Wang, C.W. de Silva / Engineering Applications of Artificial Intelligence 21 (2008) 470–484
International Conference on Intelligent Robots and Systems, EPFL, Sugar, T., Desai, J.P., Kumar, V., et al., 2001. Coordination of multiple
Lausanne, Switzerland, 2002, pp. 970–977. mobile manipulators. In: Proceedings of the 2001 IEEE International
Mataric, M.J., 1997. Reinforcement learning in the multi-robot domain. Conference on Robotics and Automation, Seoul, Korea, 2001,
Autonomous Robots 4 (1), 73–83. pp. 3022–3027.
Mitchell, T.M., 1997. Machine Learning. McGraw-Hill, New York. Sugar, T.G., Kumar, V., 2002. Control of cooperating mobile manip-
Miyata, N., Ota, J., Arai, T., Asama, H., 2002. Cooperative transport by ulators. IEEE Transactions on Robotics and Automation 18 (1), 94.
multiple mobile robots in unknown static environments associated Wang, Y., de Silva, C.W., 2005. An object transportation system
with real-time task assignment. IEEE Transactions on Robotics and with multiple robots and machine learning. In: Proceedings of the
Automation 18 (10), 769–780. 2005 American Control Conference (ACC 2005), Portland, OR,
Nolfi, S., Floreano, D., 1999. Learning and evolution. Autonomous pp. 1371–1376.
Robots 7 (1), 89–113. Wang, Y., de Silva, C.W., 2006a. Multi-robot box-pushing: single-agent
Parker, L.E., 1998. ALLIANCE: an architecture for fault tolerant Q-learning vs. team Q-learning. In: Proceedings of the 2006 IEEE/RSJ
multirobot cooperation. IEEE Transactions on Robotics and Auto- International Conference on Intelligent Robots and Systems (IROS),
mation 14 (2), 220–240. Beijing, China, pp. 3694–3699.
Parker, L.E., 2000. Lifelong adaptation in heterogeneous multi-robot Wang, Y., de Silva, C.W., 2006b. Extend single-agent reinforcement
teams: response to continual variation in individual robot perfor- learning approach to a multi-robot cooperative task in an unknown
mance. Autonomous Robots 8 (2000), 239–267. dynamic environment. In: Proceedings of the IEEE 2006 International
Pereira, G., Campos, M., Kumar, V., 2004. Decentralized algorithms for Joint Conference on Neural Networks (IJCNN), Vancouver, Canada,
multi-robot manipulation via caging. The International Journal of pp. 10098–10104.
Robotics Research 23 (7–8), 783–795. Wang, Z., Takano, Y., Hirata, Y., et al., 2004. A pushing leader based
Rus, D., Donald, B., Jennings, J., 1995. Moving furniture with teams of decentralized control method for cooperative object transportation. In:
autonomous robots. In: Proceedings of the IEEE/RSJ International Proceedings of the 2004 IEEE/RSJ International Conference on
Conference on Human Robot Interaction and Cooperative Robots, Intelligent Robots and Systems, Sendai, Japan, pp. 1035–1040.
Pittsburgh, PA, 1995, pp. 235–242. Wang, Z., Hirata, Y., Kosuge, K., 2005. An algorithm for testing object
Schenker, P.L., Huntsberger, T., et al., 2003. Planetary rover develop- caging condition by multiple mobile robots. In: Proceedings of the
ments supporting MARS exploration, sample return and future 2005 IEEE/RSJ International Conference on Intelligent Robots and
human–robotic colonization. Autonomous Robots 14 (2003), 103–126. Systems, Edmont, Alta., Canada, pp. 2664–2669.
Stone, P., Veloso, M., 1998. Layered approach to learning client behaviors Yamada, S., Saito, J., 2001. Adaptive action selection without explicit
in the ROBOCUP soccer server. Applied Artificial Intelligence 12 communication for multi-robot box-pushing. IEEE Transactions on
(2–3), 165–188. Systems, Man and Cybernetics 31 (3), 398–404.
Stone, P., Veloso, M., 2000. Multiagent systems: a survey from a machine Yamashita, A., Arai, T., Ota, J., et al., 2003. Motion planning of multiple
learning perspective. Autonomous Robots 8 (3), 345–383. mobile robots for cooperative manipulation and transportation. IEEE
Stroupe, A., Huntsberger, T., et al., 2005. Behavior-based multi-robot Transactions on Robotics and Automation 19 (2), 223–237.
collaboration for autonomous construction tasks. In: Proceedings of Yang, E., Gu, D., 2004. Multiagent reinforcement learning for multi-robot
the 2005 IEEE/RSJ International Conference on Intelligent Robots systems: a survey. Technical Report. /http://robotics.usc.edu/maja/
and Systems, Edmont, Alta., Canada, 2005, pp. 1989–1994. teaching/cs584/papers/yang04multiagent.pdfS.