You are on page 1of 15

Engineering Applications of Artificial Intelligence 21 (2008) 470–484

www.elsevier.com/locate/engappai

A machine-learning approach to multi-robot coordination


Ying Wang, Clarence W. de Silva
Industrial Automation Laboratory, Department of Mechanical Engineering, The University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
Received 7 May 2006; received in revised form 22 February 2007; accepted 15 May 2007
Available online 12 July 2007

Abstract

This paper presents a machine-learning approach to the multi-robot coordination problem in an unknown dynamic environment.
A multi-robot object transportation task is employed as the platform to assess and validate this approach. Specifically, a flexible two-
layer multi-agent architecture is developed to implement multi-robot coordination. In this architecture, four software agents form a high-
level coordination subsystem while two heterogeneous robots constitute the low-level control subsystem. Two types of machine
learning—reinforcement learning (RL) and genetic algorithms (GAs)—are integrated to make decisions when the robots cooperatively
transport an object to a goal location while avoiding obstacles. A probabilistic arbitrator is used to determine the winning output
between the RL and GA algorithms. In particular, a modified RL algorithm called the sequential Q-learning algorithm is developed to
deal with the issues of behavior conflict that arise in multi-robot cooperative transportation tasks. The learning-based high-level
coordination subsystem sends commands to the low-level control subsystem, which is implemented with a hybrid force/position control
scheme. Simulation and experimental results are presented to demonstrate the effectiveness and adaptivity of the developed approach.
r 2007 Elsevier Ltd. All rights reserved.

Keywords: Multi-robot systems; Cooperative control; Q-learning; Genetic algorithms; Intelligent transportation; Multi-agent systems; Autonomous
robots

1. Introduction Much work has been accomplished for multi-robot


transportation by the robotics community. Parker (1998,
A multi-robot system is a group of robots, which are 2000) proposed a distributed and behavior-based multi-
organized into a multi-agent architecture so as to robot architecture: L-ALLIANCE. L-ALLIANCE uses
cooperatively carry out a common task. In the past 10 two concepts of ‘‘impatient’’ and ‘‘acquiesce’’ to motivate
years, multi-robot systems have received more and more behaviors dynamically. Moreover, by assessing perfor-
attention in the robotics community because such systems mances of the teammates and dynamic changes in the
possess some special capabilities like cooperative behavior, environment, L-ALLIANCE autonomously updates its
robustness, parallel operation, and scalability (Arai et al., parameters to adapt to those changes. L-ALLIANCE was
2002; Cao et al, 1997; Liu and Wu, 2001). validated in a cooperative box-pushing task by two
An important research topic in the multi-agent robotics heterogeneous mobile robots.
is the multi-robot object transportation problem. In this Miyata et al. (2002) studied cooperative transportation
problem, several autonomous robots move cooperatively by multiple mobile robots in unknown static environments.
to transport an object to a goal location and orientation in A centralized task planning and assignment architecture is
a static or dynamic environment, which usually accom- proposed in their work. Priority-based assignment algo-
modates fixed or removable obstacles. The object is usually rithms and motion planning approaches are employed. In
heavy or sufficiently large so that no single robot can their later paper (Yamashita et al., 2003), a motion
handle it alone. It is a rather challenging task. planning approach in a 3D environment was presented.
Similar research in multi-robot transportation can be
Corresponding author. Tel.: +1 604 822 4850; fax: +1 604 822 2403. found in Rus’s work (Rus et al., 1995), where they
E-mail address: yingwang70@hotmail.com (Y. Wang). employed a team of autonomous robots to move furniture,

0952-1976/$ - see front matter r 2007 Elsevier Ltd. All rights reserved.
doi:10.1016/j.engappai.2007.05.006
Y. Wang, C.W. de Silva / Engineering Applications of Artificial Intelligence 21 (2008) 470–484 471

and in Yamada’s work (Yamada and Saito, 2001) where with a swarm of robots. Ferch and Zhang (2002) presented
they developed a multi-robot box-pushing system based on a cooperative manipulation project where two industrial
a reactive behavior architecture. robots learn cooperative grasping with RL. However, all
Huntsberger et al. (2003) proposed a behavior-based these developments assumed a static and known environ-
multi-robot architecture, CAMPOUT, for the multi-robot ment in the multi-robot systems.
work crew (RWC) project developed in JPL of NASA Arkin (Martinson et al., 2002; Martinson and Arkin,
(Schenker et al., 2003). CAMPOUT is a fully distributed, 2003) integrated the behavior-based approach with ma-
behavior-based multi-robot architecture. It employs the chine learning. They used the behavior-based representa-
leader–follower decentralized control strategy. CAMP- tion to reduce the learning space in Q-learning, and the
OUT was validated in a multi-robot cooperative transpor- robots learned to select their roles with Q-learning. A
tation task in a PV tent autonomous deployment project multi-robot foraging task validated their method.
on a planetary surface and a cliff robot project. In their Mataric (1997) also integrated machine learning with a
latest work, Stroupe et al. (2005) presented a multi-robot behavior-based architecture for multi-robot transportation and
autonomous construction task based on the CAMPOUT foraging tasks. In addition, she studied the effects of commu-
architecture. nication among robots on Q-learning. In their latest paper,
Wang (Wang et al., 2004, 2005) and Kumar (Pereira Jones and Mataric (2004) presented a formal framework for
et al., 2004; Sugar et al., 2001; Sugar and Kumar, 2002) describing and reasoning about coordination in a multi-robot
studied the caging formation control problem in multi- system, and developed a method for automatically synthesizing
robot cooperative transportation. They proposed the the controllers of robots. However, they assumed a simple
‘‘object closure’’ strategy to move the object with multiple environment with no obstacles in their research.
mobile robots while maintaining the formation. It seems popular to apply RL to multi-robot coordina-
A research field related to multi-robot object transporta- tion. However, the above examples using RL in the multi-
tion is multi-robot manipulation, where multiple industrial robot domain do not have strong theory basis. The
robotic manipulators cooperatively carry an object while environment in a multi-robot task is usually dynamic and
minimizing internal forces that get generated among it is affected simultaneously by multiple robots. Simply
robots. There has been much work done in this field. For extending the single-agent RL algorithm to the multi-robot
example, Kumar and Garg (2004) designed a fuzzy logic field violates its assumption of stationary environment
controller to control internal forces and moments while (Yang and Gu, 2004; Wang and de Silva, 2006a). However,
two industrial manipulators cooperatively carry an object. it is believed that this kind of extension is feasible if it is a
Another interesting topic in the field of multi-agent purely cooperative task such as multi-robot transportation
robotics is multi-robot learning (Stone and Veloso, 2000; (Yang and Gu, 2004).
Yang and Gu, 2004). The learning capability is desirable There are common shortcomings in the existing work in
for a multi-robot system. It will help the robots to cope the area of multi-robot object transportation, some of
with a dynamic or unknown environment, find the optimal which are listed below:
cooperation strategy, and make the entire system increas-
ingly flexible and autonomous. Although most of the 1. Inadequate emphasis on the learning mechanism in the
existing commercial multi-robot systems are controlled object transportation process.
remotely by a human, the autonomous performance will 2. No integration of learning and evolution. (Learning is
desirably be the objective of the next-generation systems. not optimal and also may fail in a dynamic and
Without the learning capability, it will be virtually im- unknown environment.)
possible for a robotic system to become a fully autonomous 3. Information (or, a model) on local dynamic/kinematic
system. Therefore, the introduction of machine-learning interactions between robots and the object is not
technologies into a multi-robot system has become a explicitly used.
necessity. 4. The control of robot contact forces is not included in the
Stone and Veloso (1998) suggested a layered approach overall control scheme.
with machine learning for multi-agent cooperation. In their 5. Implementations on physical robots are few and not
paper, neural networks were used to learn the low-level rigorous.
skills of a robot and the decision tree method was
employed to learn the high-level strategies. However, their In the present paper, a physical multi-robot system
work focused on the specific application to robotic soccer. integrated with machine learning is developed, for carrying
Inoue (Inoue et al., 2004) suggested to employ reinforce- out object transportation in a dynamic and unknown
ment learning (RL) in an object transportation task by two environment. The multi-agent architecture, machine learn-
humanoid robots. They used Q-learning to learn the ing, and hybrid force/position control are incorporated,
correct positions of the robots while the robots tightly enabling the system to operate in a robust and effective
cooperate to move the object. Ito (Ito and Gofuku, 2004) manner. The performance of the developed system is
integrated genetic algorithm (GA) with RL to learn action evaluated through computer simulation and laboratory
selection strategies in a cooperative transportation task experimentation.
472 Y. Wang, C.W. de Silva / Engineering Applications of Artificial Intelligence 21 (2008) 470–484

The main objective of the present paper, which is based on the optimal locations and amplitudes of the individual
on our previous work (Wang and de Silva, 2005, 2006a, b), forces applied by them so that the object is transported
is to develop a general learning-based multi-agent archi- quickly and effectively while avoiding any obstacles that
tecture for multi-robot coordination in a dynamic and may be present in the path. In some special cases, they also
unknown environment. The main contributions of the need to consider whether to remove the obstacles rather
work are as follows: First, the learning based multi-robot than negotiating around them. Other considerations such
architecture, which is developed here, supports multi-robot as the level of energy utilization and avoiding damage to
systems that work in a dynamic unknown environment. the transported object may have to be taken into account.
This addresses a major challenge that is faced by A global camera is used to monitor and measure the
researchers in the multi-robot community since previous current location and orientation of the object. The environ-
work, like the behavior-based and planning-based ap- ment is dynamic and unpredictable with some movable
proaches (Huntsberger et al., 2003; Parker, 2000; Miyata obstacles, which may appear randomly, and some fixed
et al., 2002), assumed a static or known environment obstacles. The robots, the camera, and the sensors are
for multi-robot transportation tasks. Second, a modified separately linked to their host computers, which are
Q-learning algorithm called the sequential Q-learning connected through a local network, to implement compli-
algorithm is developed, which is proved to be more suitable cated controls and machine intelligence.
than the conventional single-agent Q-learning algorithm A robotic arm and a mobile robot are used in the system
for decision making associated with multi-robot systems shown in Fig. 1. This system provides an opportunity to
that encounter resource or behavior conflicts. Third, GAs observe the cooperation between two heterogeneous robots
are integrated into the developed Q-learning algorithm for with different dynamics. Moreover, with the exact localiza-
coping with the local minimum problem. Last, to the best tion ability of the robotic arm and the broad ground
of our knowledge, the system developed in this paper is the coverage capability of the mobile robot, it is possible to
first machine-learning-based multi-robot physical trans- integrate the different advantages of the two robots in a
portation system that can work in a dynamic and unknown complementary manner, so as to improve the effectiveness
environment with complex obstacle distributions. of the overall system.

2. Problem formulation 3. Multi-agent infrastructure

The primary objective of the work presented here is to The multi-agent architecture as outlined is used here as
develop a physical multi-robot system, where a group of the infrastructure to implement cooperative activities
intelligent robots work cooperatively to transport an object between the robots. This infrastructure is schematically
to a goal location and orientation in an unknown dynamic shown in Fig. 2.
environment. Obstacles may be present and even appear There are four software agents and two physical agents
randomly during the transportation process. Robot con- in this architecture, forming the overall multi-agent system.
trol, multi-agent approaches, and machine learning are Each agent possesses its own internal state (intention and
integrated into the developed physical platform to cope belief) and is equipped with own sensing and decision-
with the main challenges of the problem. A schematic making capabilities. They are also able to communicate
representation of the first version of the developed system with each other and exchange information on their
is shown in Fig. 1. environment as acquired through sensors, and the inten-
In this first prototype there is one fixed robotic arm and tions and actions of the peers. Based on the information
one mobile robot, which cooperatively transport an object from their own sensors and their internal states, the agents
to a goal location. During the course of the transportation, cooperatively determine a cooperation strategy to trans-
they have to negotiate the cooperation strategy and decide port the object.

Vision Camera
Learning/
Agent
Evolution Agent
High-level
Coordination
Robot Robot
Assistant Assistant
Agent #1 Agent #2

Physical Low-level
Physical
Agent #2 Control
Agent #1

Fig. 1. The developed system. Fig. 2. The multi-agent architecture used in the system.
Y. Wang, C.W. de Silva / Engineering Applications of Artificial Intelligence 21 (2008) 470–484 473

In Fig. 2, the four software agents constitute a high-level 4.1. Reinforcement learning
coordination subsystem. They will cooperate and coordi-
nate with each other to generate cooperation strategies, A key problem in a multi-robot transportation system is
and assign subtasks to the two robot assistant agents. In the determination of the cooperation strategy; i.e., finding
the meantime, the two physical agents (robots) form a low- the optimal amplitudes and locations of the applied forces
level control and execution subsystem, which receives of the robots so that a net force with maximum amplitude
commands from the high-level subsystem. points to the goal location, while avoiding any obstacles.
There is a vision agent, which is in charge of acquiring RL provides a way to determine an appropriate coopera-
and processing the images using a CCD camera. Because tion strategy through experimentation with the envi-
the present paper focuses on developing a learning-based ronment. In our previous work (Wang and de Silva,
multi-robot architecture, a simplified computer vision 2005, 2006b), a multi-robot box-pushing task based on
scheme is employed, which uses a global camera and the Q-learning was developed and validated. However, two
global localization technology. This vision agent will disadvantages were noticed when directly employing the
analyze the acquired images and compute the exact conventional Q-learning algorithm in tightly coupled
locations and orientations of the robots, the object, and multi-robot transportation tasks. One disadvantage is that
the obstacles. All the information is then broadcasted to there may exist behavior conflicts among robots when they
the other three software agents so that they will immedi- select actions concurrently. When two robots select the
ately know the current world state (the position and same action to push the box to the same location, they
orientation messages of the robots, the object and the will compete with each other, thereby violating the
obstacles, and the environment map information). cooperation strategy. The other disadvantage of directly
A special learning/evolution agent is used in the using Q-learning in multi-robot systems is that it results in
architecture to play the combined role of a learner, a a poor cooperation strategy and slow learning because one
monitor, and an advisor. The machine-learning algorithm robot does not consider the actions of the other robots
and the reasoning/decision-making capabilities are em- before it selects its own action. A possible solution to
bedded in this agent. First, it will collect from other agents this problem is to employ a team Q-learning algorithm
the information of the environment, the robots, the object, (Littman, 2001). However, such an algorithm will result in
and the obstacles to form the current world state, which an excessively large learning space in multi-robot tasks,
can be shared by its peers. Then it will make decisions on making it not practical (Wang and de Silva, 2006a). In
the optimal cooperation strategy based on its own knowl- order to overcome these two disadvantages, a modified
edge base so that the common task can be completed Q-learning algorithm called the sequential Q-learning
effectively and quickly. Finally, it will assign the roles of algorithm is developed here, which is suitable for multi-
other agents, and monitor the task schedule. robot systems with behavior conflicts. This algorithm may
Each physical agent (robot) is linked to a corresponding be summarized as follows:
assistant agent, which needs to forward the robot position/
force information to the other agents in the upper level,  Assume that there are n robots, R1 ; R2 ; . . . ; Rn , which are
and to receive the desired force/trajectory of the robot from arranged in a special sequence, and their subscripts
the learning/evolution agent and send them to the represent their positions in this sequence.
corresponding robot in the low-level subsystem.  L1 ; L2 ; . . . ; Ln are the corresponding action sets available
All six agents will form a tightly coupled multi-agent for the robots. In particular, the robot Ri has mi actions
system to transport the object cooperatively to the goal available for execution as Ri : Li ¼ ðai1 ; ai2 ; . . . ; aimi Þ.
location while avoiding obstacles.  Q1 ðs; aÞ; Q2 ðs; aÞ; . . . ; Qn ðs; aÞ are the corresponding Q
tables for the robots. All entries in the Q tables are
4. Cooperation based on machine learning initialized to zero.
 Initialize t to 0.99.
The learning ability is a desirable feature for a multi-  C is a set including all actions selected by the robots so
robot transportation system. It will help the robots to cope far, and f represents the empty set.
with the dynamic changes in an unknown environment.  Do repeatedly the following:
Although conventional planning algorithms can make the o Initialize C ¼ f
robots move effectively in a static environment, they o% Observe the current world state s
usually fail in an environment with dynamic obstacles. o% For (i ¼ 1 to n)
A multi-robot system with learning capability possesses % – Generate the currently available action set
characteristics of adaptation and self-organization, which Di ¼ ðLi  ðLi \ CÞÞ
i
will enable it to overcome the difficulties in an unknown – The robot Ri selects the actioni aP jADi
= kr¼1 eQi ðs;ar Þ ,
i Qi ðs;aj Þ i
environment. In this paper, two well-known machine- with probability P(aj ) ¼ e
i
learning approaches, RL and GAs, are integrated into the where ar ADi ðr ¼ 1; 2; . . . ; kÞ and k is the size
learning/evolution agent to achieve robust and effective of the set Di
cooperation between the robots. – Add action aji to the set C
474 Y. Wang, C.W. de Silva / Engineering Applications of Artificial Intelligence 21 (2008) 470–484

o End For conflicts and cooperation failure in multi-robot cooperative


o% For each robot Ri ði ¼ 1; 2; . . . ; nÞ, execute the transportation tasks.
% corresponding selected action a i In the system developed in this paper, the world state is
j
o Receive an immediate reward r defined as the relative position and orientation between the
o% Observe the new state s0 object and the goal location, and the positions of the
o% For each robot Ri ði ¼ 1; 2; . . . ; nÞ, update its obstacles within a fixed detection radius around the object.
% table entry for Q ðs; a1 ; a2 ; . . . ; ai Þ as follows: It is shown in Fig. 3.
i j j j
In Fig. 3, only obstacles located within the detection
radius around the object are detected and considered. The
Qi ðs; a1j ; a2j ; . . . ; aij Þ ¼ ð1  ÞQi ðs; a1j ; a2j ; . . . ; aij Þ
detection circle is evenly divided into 16 sectors and the
þ ðr þ m max Qi ½s0 ; a1 ; a2 ; . . . ; ai Þ, ð1Þ sectors are represented by a 16-bit binary code. If one
a1 ;a2 ;...;ai
obstacle is located in one of the sectors, the corresponding
where 0oeo1 is the learning rate and 0omo1 is the bit of the binary code will become ‘‘1’’. Otherwise it will
discount rate become ‘‘0’’ accordingly. This binary code represents the
o s’s0 , t’t*0.999. obstacle distribution around the object. The world state s
%
in the Q-learning algorithm is made up of this binary code
The basic idea of the sequential Q-learning algorithm and the angle y in Fig. 3, which describes the relative
comes from a strategy that is typically employed by orientation between the goal location and the object. For
humans when they cooperatively push a large and heavy example, the world state shown in Fig. 3 can be represented
object to a goal location, where the group members will not by the vector s ¼ [y 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0].
select their pushing locations and forces concurrently. The cooperation strategies, i.e., the force locations and
Instead, one will select his pushing location and apply a amplitudes of all robots, which are introduced in Fig. 4,
force first. Then, by observing the first person’s action, the form the set of actions in the Q-learning algorithm.
second person will select his pushing action (i.e., coopera- In Fig. 4, the application point of the force from each
tive strategy) accordingly. Next, the third person will robot is represented by a pair of angles (a, b). Therefore,
determine his action by observing the actions of the first possible actions in the Q-learning algorithm for the
two people. In this manner, when all the people in the robot Ri can be modeled as the set Li ¼ fða; b; F Þk g,
group have determined their actions, they will execute the k ¼ 1; 2; . . . ; mi .
overall pushing actions through simple synchronization. There remains a special action called ‘‘sweeping’’, which
This cooperative strategy is known to work well in manual instructs the robots to remove all the obstacles in the path
tasks. The same strategy is used here to develop the
sequential Q-learning algorithm for multi-robot transpor-
tation tasks.
In the sequential Q-learning algorithm, in each step of Obstacle
y Goal Location
decision making, the robots do not select their actions at 1 0
0 0
the same time. Instead, they select their actions one by one
according to a predefined sequence. Before a robot selects 0 0
its action, it observes which actions have been selected by θ
0 1
the robots who precede it in the sequence. By not selecting
0 x
the same actions as those of the preceding robots, this 0
robot successfully solves the behavior conflict problem and Detection 0 Object 0
promotes effective cooperation with its teammates. Radius
1 0
The benefits of the sequential Q-learning algorithm 1 0
are obvious. Because each robot observes the actions of
Fig. 3. The world state representation.
the robots preceding it in the sequence before it makes its
own decision, the sequential Q-learning algorithm is likely
to possess more effective cooperation and faster conver-
gence than a single-agent Q-learning algorithm in multi- y F1
robot cooperative tasks (Wang and de Silva, 2006b). On Goal Location
1
the other hand, because each robot does not need to Robot #1
ai
observe the actions of all its teammates, the sequential a1
Q-learning algorithm results in a significantly smaller x
learning space (or Q tables) than that for the team Robot #i
i Object
Q-learning algorithm (Wang and de Silva, 2006a). Fi
In addition, by selecting their actions according to a
particular predefined sequence, two robots avoid selecting
the same pushing action which would result in behavior Fig. 4. Representation of the cooperation strategy (action).
Y. Wang, C.W. de Silva / Engineering Applications of Artificial Intelligence 21 (2008) 470–484 475

after establishing that they are capable of carrying out this The fitness is calculated as follows:
sweeping task. 8
> if it is a ‘‘sweeping’’ action and some
The reward r is calculated as follows: >
> C
>
> removable obstacles are in the path:
r ¼ Dti  Dti1 , (2) >
<
Fitness ¼ if a ‘‘sweeping’’ action without
where, Dti and Dti1 are the distances between the mass >
> 0
>
> removable obstacles in the path
center of the object and the goal location at times ti and >
>
:
ti1, respectively. M otherwise;
(4)
4.2. Genetic algorithms
k1  jF j þ k2  ð1 þ cos yÞ2 þ k3 =jGj þ k4 S þ k5 =ð1 þ cos tÞ2
Another approach that is used here to find the optimal M¼ ,
k 1 þ k 2 þ k 3 þ k 4 þ k5
cooperation strategy is GAs. By simulating the biological
(5)
evolution process, GA provides a powerful way to quickly
search and determine the optimal individual in a large where, C is a threshold value which is determined through
candidate space. trials, F is the net force applied by the robots, y is the angle
A typical GA as found in Karray and de Silva (2004) and between the net force vector and the goal location vector
Mitchell (1997) is given below: which is described in Fig. 3, G is the net torque applied by
the robots, S is the area of the polygon formed by the robot
 Initialize p (the number of individuals in the population), locations, t is the angle between the net force vector and
Z (the fraction of the population to be replaced by the position vector of its closest obstacle, and k1–k5 are the
crossover at each step), m (the mutation rate), and weights (k1+k2+k3+k4+k5 ¼ 1).
fitness_threshold. The fitness function includes several practical considera-
 Generate p individuals at random to produce the first tions. First, the ‘‘sweeping’’ action is always intended for
generation of population P use in the current world state. If a removable obstacle is
 Calculate the fitness(i) for each i (individual) in P present in the forward path of the object (it is assumed
 While ðmaxi fitnessðiÞÞofitness_threshold, then repeat: that, based on its size, color, shape, and so on, the agents
o Produce a new generation Ps: are able to judge whether the obstacle can be moved away
% – Probabilistically select (1Z)p members of P to be by the robots), the ‘‘sweeping’’ action will assume a high
included in Ps. The selection probability is given by fitness value so that the obstacle can be removed
immediately. In order for it to always have the opportunity
fitnessðiÞ to be selected, the ‘‘sweeping’’ action is included in the
lðiÞ ¼ (3) population of the first generation and it is always copied to
P
p
fitnessðjÞ the next generation in the evolution process.
j¼1
Second, a large net force that points to the goal location
– Execute the crossover operation to generate (Zp)/2 will earn a high fitness value so that the internal force
pairs of offspring and add them to Ps becomes minimum and the object is transported to the goal
– Mutate m percent of the numbers of Ps location quickly and effectively. Third, a high net torque G
– Update P’Ps is not encouraged, because the unnecessary rotation of the
– Calculate fitness(i) for each i in P object will make it difficult for the robots to handle it, while
 Return the individual with the highest fitness in P. a small G may be encouraged for avoiding the obstacles.
Fourth, a large S implies that the spacing of the robots is
In this paper, the entire candidate space is formed quite adequate, which will receive a high fitness value,
by F ¼ ða1 ; b1 ; F 1 ; . . . ; ai ; bi ; F i ; . . . an ; bn ; F n Þ described in because crowding of the robots will make their operations
Fig. 4, where n is the number of robots, which is the set of more difficult and less effective. Finally, a small t means
all possible cooperation strategies in Fig. 4, and the that the probability of the presence of an obstacle in the
‘‘sweeping’’ action. This candidate space can be represented path is high, this condition should be punished. The
by a vector space R3n+1, which is made up of all possible weights k1k5 are calibrated through experimentation.
actions: ðF; sweepingÞ as given in Section 4.1. The GA given In each step of object transportation, the GA is used to
above is used to search for the optimal individual vector; search for the optimal cooperation strategy. Then this
i.e., the optimal robot cooperation strategy, in this large strategy is broadcasted to the two robot assistant agents by
vector space. the learning/evolution agent. Next, the robot assistant
In the beginning, the ‘‘sweeping’’ action and nine agents forward the commands to the physical robots.
individuals randomly selected from the candidate space Based on their internal hybrid force/position controllers,
are used to constitute the first generation of population. the robots then move themselves to the desired position
Then the GA outlined before is used to search for and transport the object for a specified time period. In the
the optimal individual with the maximum fitness value. meantime, the current force and position information is
476 Y. Wang, C.W. de Silva / Engineering Applications of Artificial Intelligence 21 (2008) 470–484

sent back to the high-level coordination subsystem via the However, by integrating GAs with RL, different
robot assistant agents for decision making in the next step. individuals can be inserted into the population of GA
After the object is moved through some distance, the world through RL, because RL uses an entirely different mecha-
state would be changed (for example, new obstacles are nism to select the new cooperation strategy. Consequently,
observed and so on). In this manner, the GA is used again it would be possible to disturb the population purity in a
to search for the new optimal cooperation strategy under GA and resolve the ‘‘crowding’’ problem in part. In
the new world state. This process will be repeated until the addition, RL is able to deal with quick changes in the
object is successfully moved to the goal location. environment, where GA usually fails.
The main problem of GA in a physical robotic system is The combination of learning with evolution, which are
that its real-time performance cannot be guaranteed two important adaptive mechanisms that exist in nature,
because of its low convergence speed. However, in the gives rise to distinct advantages, as discussed before. In this
present paper, this problem is solved partly by using the paper, the specific integration scheme indicated in Fig. 5
layered architecture shown in Fig. 2. In this two-layer is implemented in the multi-agent object transportation
multi-agent architecture, the low-level physical agents can system.
continue to execute the old commands before the new In Fig. 5, the sensor data are first sent to a sensory fusion
commands are given by the agents in the upper level. In module. This information will form the current positions
other words, the agents in the upper level are not necessary and orientations of the object, the robots and the obstacles,
to work at the speed of the physical agents in the lower and the state of the environment. Moreover, there is a
level. This enables a ‘‘low-speed’’ GA agent to make model knowledge base in which the geometry and dynamic
decisions for a real-time task. parameters of the object and the robots are stored. These
model parameters are determined off-line and stored in the
4.3. Integration of learning and evolution model knowledge base prior to operating the system. When
the sensory data is sent to a local modeling module, the
RL may face difficulties and generate non-optimal module will identify and update the current model
outputs. A typical problem is that ‘‘the agent runs the risk parameters by combing the off-line information in the
of over-committing to actions that are found to have high knowledge base and the new sensory data. The information
Q values during the early stages of training, while failing to from the sensory fusion module and the model knowledge
explore other actions that also have high values’’ (Mitchell, base is then integrated to form the new world state, as
1997). Another problem is that RL may not be able to described in Sections 4.1 and 4.2.
capture the slow changes in the environment. In other The two decision-making mechanisms, RL and GA, are
words, when the environment changes very slowly, RL will implemented based on the algorithms presented in the
have less opportunity to try the actions under different previous sections. They both are able to determine a robot
world states. Consequently, the knowledge base of RL will cooperation strategy according to the current world state.
be updated very slowly. It means that the learning process Their outputs are probabilistically selected by an arbi-
will become very slow and the output of RL will not be trator, which determines the winning output. In this paper,
optimal. it is assumed that the output of the RL block is selected by
GAs can partly overcome these problems. Because of the the arbitrator with a probability of 0.7, while the output
special evolution operations such as crossover and muta- of the GA block is selected with a probability of 0.3.
tion in GAs, it is possible for GAs to select a new search
point in the candidate space, which had not been tried in
the past steps (Nolfi and Floreano, 1999). This will provide Model KB
an opportunity for the actions with low-Q-values in RL to Sensors GA
demonstrate their capabilities. Moreover, because GAs Knowledge
simulate the biological evolution process, they provide Base (KB)
better adaptivity than RL and are able to capture slow Sensory Robot and Object
changes in the environment. Fusion Local Modeling
On the other hand, GAs also benefit from RL because Probabilistic
Genetic
GAs cannot guarantee a precisely optimal output. In World State Algorithms Arbitrator Action
particular, in the system developed in the present work, Extraction at time t
World state Action j
the crowding phenomena (Mitchell, 1997) was observed, at time t
where ‘‘very similar individuals take over a large frac-
Reinforcement
tion of the population’’. In addition, GAs are unable Learning Action i
to deal with quick changes in the environment because
the evolution process usually requires a considerable
RL KB Action Performance
duration of time. All these problems greatly slow down Rewarding Evaluation
the evolution process and make the output of the GAs
unacceptable. Fig. 5. The integration scheme of RL and GA.
Y. Wang, C.W. de Silva / Engineering Applications of Artificial Intelligence 21 (2008) 470–484 477

The output of the arbitrator, i.e., the cooperation strategy, of the corresponding robot. As the supervisor in the upper
is then broadcasted to the two robot assistant agents by level for the low-level control subsystem, the agent then
the learning/evolution agent to implement the object sends commands to the low-level position controller to
transportation. initiate the first control step, which is to move the robots to
the desired position and orientation. Next, the agent starts
5. Robot hybrid force/position control the second control step, which requires the position
controller to continue the position control strategy,
Force/position control of robots is indispensable in a while asking the force controller to execute the desired
physical multi-robot transportation system. Without prop- force control strategy in a specific direction in order for the
er control of the forces and positions of the robots, the robot to exert a constant pushing or pulling force on the
cooperation strategy described in Section 4 may not be object.
effective. The hybrid force/position control scheme shown
in Fig. 6 is implemented in the low-level control subsystem 6. Simulations
in the multi-robot system of the present work.
In Fig. 6, two sets of control modes, position control and 6.1. Basic results
force control, are integrated. Here L and L0 are the
direction constraint matrices (Craig, 2005), which, respec- In order to verify and validate the effectiveness of the
tively, determine in which direction the force control mode integrated learning/evolution decision-making approach
is used and in which directions the position control mode is outlined before, a simulation system based on Java multi-
employed. They are diagonal matrices with ones and zeros thread programming is developed. This system will
on the diagonal. For a diagonal element of ‘‘1’’ in L, there simulate the multi-robot object transportation processes
is a ‘‘0’’ as the corresponding diagonal element of L0 , and assess the feasibility of the learning/evolution-based
indicating that position control is in effect. On the other multi-agent architecture.
hand, if a ‘‘0’’ is present on the diagonal of L, then ‘‘1’’ will In Fig. 7, a rectangular environment with dimensions of
be present as the corresponding element of L0 , indicating 950  700 pixels is assumed. Several fixed obstacles are
that force control is in effect. Note that L and L0 are set by distributed randomly in the environment. In addition, in
the robot assistant agent in the upper level. Moreover, the order to alleviate the programming burden, the object is
robot assistant agent will provide the desired forces and simplified as a rectangular box with dimensions of 120  80
positions for the low-level control, based on the current pixels. There are two robots, denoted by circles with a
cooperation strategy broadcasted by the learning/evolution radius of 5 in the environment, in the subsequent figures,
agent. which will push the box to the goal location while avoiding
In addition, an adaptive control mechanism (Astrom the obstacles. It is assumed that the box is sufficiently big
and Wittenmark, 1994) is integrated into the control and heavy so that no individual robot can handle it without
scheme. In each control loop, the robot dynamic model is the help of the peer robot. The Java multi-thread
identified on-line and this information is used to tune the programming technology is used to implement the parallel
controller parameters. operations of the robots.
Finally, a two-step control strategy is used in the Some parameters of the simulation system are listed
practical control process. When the robot assistant agent below: e ¼ 0.9, m ¼ 0.8, p ¼ 10, Z ¼ 0.2, M ¼ 0.5, C ¼ 10,
receives a new cooperation strategy from the learning/ k1 ¼ 0.4, k2 ¼ 0.3, k3 ¼ 0.05, k4 ¼ 0.1, k5 ¼ 0.15, the
evolution agent, it computes the desired force and position detection radius is 72, and the total number of rounds in
the training stage is 1000. After training, two rounds of
box-pushing are presented in Fig. 8.
q Parameter Model
Tuning Identification
-
Position L
Controller
+
Random Obstacles Goal Location
qd Direction
q
Constraints Robot
Assistant Robot Box
Agent f
fd
(Supervisor)
+
Force L’
- Controller
Parameter Model Robots
f Tuning Identification

Fig. 6. The robot hybrid force/position control scheme. Fig. 7. The simulation system of the multi-robot transportation.
478 Y. Wang, C.W. de Silva / Engineering Applications of Artificial Intelligence 21 (2008) 470–484

transportation. Here, in the environment shown in


Fig. 8(b), the robots with blank Q tables (i.e., all entries
in the Q tables are initialized to zero) are employed to push
the object (box) repeatedly from the starting point to the
goal location for 500 rounds. Every time a round of box-
pushing is completed, the total number of steps in that
round is recorded, all entries in the Q tables are initialized
to zero again, and a new round of box-pushing is started.
After 500 rounds of box-pushing is completed, the
probability density of the number of steps per round is
estimated, which is shown in Fig. 9. Next, the robots are
trained for 1000 rounds of continuous box-pushing to gain
knowledge, which is stored in their Q tables. After the
training stage, the resulting Q tables (i.e., the learned
knowledge bases) are saved in the disk files. Then, the
robots with the learned Q tables are employed to push the
box repeatedly from the starting point to the goal location
for another 500 rounds. Every time a round of box-pushing
is completed, the total number of steps in that round is
recorded and the Q tables stored in the disk files are
reloaded to start the next round of box-pushing. As before,
the probability density of the number of steps per round in
the 500 rounds is estimated, which is shown in Fig. 9.
From Fig. 9 it is clear that the proposed learning
approach has effectively improved the performance of the
multi-robot team in the training stage. Before the training
stage, the robots have spent a long time exploring to push
the box from the starting point to the goal location. The
Fig. 8. The simulation results of multi-robot object transportation. mean value of the number of steps per round is about 1500
(a) A round of box-pushing. (b) Another round of box-pushing under a
and its standard deviation is quite large. However, after
different obstacle distribution.
1000 rounds of training, the learned knowledge helped the
robots to complete the task quickly. In particular, the
From Fig. 8 it is observed that the multi-robot system robots only spend a mean value of 500 steps per round at a
equipped with the learning/evolution based multi-agent much smaller standard deviation to push the box from the
architecture is able to successfully transport the object to starting point to the goal location in the same environment.
the goal location in a complicated environment with a
random obstacle distribution. In Fig. 8(a), the robots first
move the object upwards to approach the goal location.
However, they soon detect two obstacles in the forward
path. Although each robot only possesses local sensing
capabilities (i.e., they only can detect the obstacles within
the detection radius), after several trials, the robots
successfully find a safe path, which is located in the
passage between the two obstacles. After avoiding the two
obstacles, the robots quickly adjust their cooperation
strategy to push the object directly toward the goal
location. Before the object reaches the goal, another
obstacle in the forward path is detected by the robots. At
this time, the robots easily avoid the obstacle and finally
push the object to the goal location. Fig. 8(b) shows
another round of multi-robot object transportation with a
more complicated obstacle distribution in the environment.
Again it is noted that the robots are able to successfully
transport the object to the goal location while avoiding the
obstacles.
The performance of the proposed machine-learning Fig. 9. Probability density estimation of the number of steps per round
approach is assessed using an experiment of object before training and after training.
Y. Wang, C.W. de Silva / Engineering Applications of Artificial Intelligence 21 (2008) 470–484 479

Therefore, it is seen that the robots have learned and


improved their cooperation strategy in the training stage.
From Figs. 8 and 9 one may conclude that the learning/
evolution based multi-agent architecture which is devel-
oped in the present work possesses good flexibility and
adaptivity in an unknown environment with a complex
obstacle distribution. This learning/evolution-based multi-
robot coordination approach has two advantages over the
conventional planning-based approaches. First, it does not
need a predefined or hardwired rule base, which is the basis
of the conventional planning-based approaches. The
approach developed in this paper learns the rules from
the history of interaction with the environment and
teammates. Therefore, it can provide more flexible selec-
tions of actions in a dynamic environment, which is usually
quite difficult for the conventional planning-based ap-
proaches to handle. Second, this approach is simple
compared with the conventional planning approaches. Fig. 10. Comparison of average reward per round in 1000 rounds of
continuous box-pushing using the sequential Q-learning and the conven-
Because the rules of decision making are learned from
tional single-agent Q-learning.
the interaction history and improved in the course of the
transportation processes, the design of the rule base is
simplified. The designer does not need to predict all
possible world states and design corresponding behavior
rules for the robots in the design stage.

6.2. Sequential Q-learning versus single-agent Q-learning

The performance of the sequential Q-learning algorithm


developed in this paper is compared with the conventional
single-agent Q-learning which was employed for multi-
robot decision making in our previous work (Wang and de
Silva, 2005). In order to make this comparison meaningful,
both Q-learning algorithms are integrated with a same GA
algorithm described in Sections 4.2 and 4.3 in this paper. In
addition, both algorithms are used to establish optimal
cooperative strategies for the same multi-robot box-
pushing system operating in the same environment shown
in Fig. 8(b). The comparison results are given in Figs. 10
and 11.
Fig. 10 shows the average reward per round received by
Fig. 11. Comparison of the number of steps per round in 500 rounds of
the robots when they pushed the box continuously for 1000
box-pushing using the sequential Q-learning and the conventional single-
rounds. It is clear that the robots learned a good agent Q-learning.
cooperation strategy and received increasing rewards in
the first 20 rounds. Furthermore, it is noted in Fig. 10 that
the sequential Q-learning algorithm shows a much better
performance (larger average rewards with a smaller
standard deviation) than the conventional single-agent
Q-learning algorithm, when used in the box-pushing task.
Fig. 11 also shows that the developed sequential
Q-learning algorithm has done a better job than the
conventional single-agent Q-learning. Here, the number of
steps per round is compared for the two algorithms. First,
both algorithms were trained for 1000 rounds to gain the
knowledge of the cooperation strategy, and the final Q
tables after training (i.e., the knowledge bases) were saved
in the disk files. Next, the robots were employed to
complete another 500 rounds of box-pushing with the Fig. 12. The experimental system in IAL at UBC.
480 Y. Wang, C.W. de Silva / Engineering Applications of Artificial Intelligence 21 (2008) 470–484

learned knowledge. Every time a round of box-pushing was value of 500 steps to push the box from the starting point
completed, the saved final Q tables were reloaded into the to the goal location, while the latter has spent a mean
robots to start the next round of task. When 500 rounds of value of 1050 steps to complete the same task in the same
box-pushing were completed, the number of steps per environment.
round was analyzed statistically. The results are presented It may be concluded from Figs. 10 and 11 that the
in Fig. 11. It is seen that the sequential Q-learning sequential Q-learning algorithm developed in this paper
algorithm has spent a shorter time to complete the task demonstrates better performance than the conventional
than what was taken by conventional single-agent single-agent Q-learning algorithm in the multi-robot
Q-learning. In particular, the former has spent a mean cooperative box-pushing task described here. The devel-
oped algorithm is particularly suitable for decision making
in a multi-robot domain with resource or behavior
High-level Coordination conflicts.
Multi-agent architecture,
Learning, Evolution
7. Testbed and system architecture

Software Computer Software


Vision Controller #2
The testbed developed for the present implementation in
Controller #1
TCP/IP
the Industrial Automation Laboratory (IAL) is shown in
Low- level Wireless Ethernet Fig. 12. It includes a four-joint 2D robotic arm, an
Control Access Point AmigoTM mobile robot, a JVCTM color CCD camera, two
Data Frame Data TCP/IP Transducer TechniquesTM force sensors, a rectangular box
Acquisition Mobile
Acquiring Grabber
Robot
(object), an indexing table, a ServoToGoTM motion
Board Board Board
CPU control board, a MatroxTM frame grabber board, a
CISCOTM wireless LAN access point, and a Pentium
Force CCD Force
Encoders
IVTM computer with 512 MB RAM. The robotic arm is
Encoders Sensor Camera Sensor Sonar particularly designed to move an object on a 2D surface. Its
Motors Motors four joints (two arm joints and two wrist joints) are
(Robotic Arm) (Mobile Robot)
equipped with optical encoders to exactly detect the joint
Hardware level motions. One force sensor is mounted in its end effector.
The encoder and force data are acquired by the host
Fig. 13. The software/hardware architecture of the experimental system. computer via the motion control board.

Fig. 14. The robots move to the desired position and push the box under hybrid force/position control. (a) The initial state. (b) Moving to the desired
position (c) Pushing of the box.
Y. Wang, C.W. de Silva / Engineering Applications of Artificial Intelligence 21 (2008) 470–484 481

The mobile robot is a two-wheel-driven intelligent addition, the colors and shapes of the blobs are used to
machine with two 500-tick encoders, eight sonar units, identify whether an obstacle is fixed or removable.
wireless LAN communication ability, and a force sensor in A C++ program is developed to integrate data
its end effector. The mobile robot is linked to the host acquisition, sensory fusion, data analysis, image grabbing
computer with the wireless LAN via a CISCOTM wireless and processing, multi-agent architecture, learning/evolu-
Ethernet access point equipment. It can wander within a tion-based decision making, and hybrid force/position
radius of 50 m without any communication interruption. digital control of the robots. Moreover, the multi-thread
A color CCD camera with a resolution of 768  492 programming technology based on Microsoft Windows
pixels is used to monitor the positions and orientations of 2000 platform and RTX real-time OS is used to implement
the box, the obstacles, and the mobile robot. The vision parallel running of the multiple agents. The software/
images are captured into the computer by a frame grabber hardware architecture is presented in Fig. 13.
board at a sampling rate of 30 MHz. There are color blobs The experimental system architecture shown in Fig. 13
on the top surfaces of the box, the mobile robot, and the includes three levels of components: hardware layer, low-
obstacles so that their orientations and locations are easily level control layer, and high-level coordination layer. The
monitored by the vision agent outlined in Fig. 2. In force sensors, the optical encoders and motors of the

Fig. 15. Position control of the robotic arm. (a) The trajectory of the end effector. (b) The x coordinate of the end effector. (c) The y coordinate of the end
effector. (d) The speed of the end effector.
482 Y. Wang, C.W. de Silva / Engineering Applications of Artificial Intelligence 21 (2008) 470–484

robots, the sonar units, and the CCD camera are in the From Fig. 17 it is seen that when the robots detect the
hardware layer. They form the basic hardware of the existence of a removable obstacle in the forward path (this
experimental system. information is given by the vision agent), they select the
In the low-level control layer, the data acquisition board, ‘‘sweeping’’ action. In the ‘‘sweeping’’ process, the two
the frame grabber board, and the mobile robot CPU are robots are assigned different roles, where the mobile robot
used to acquire information, analyze it, and control the soon leaves the box, moves towards the obstacle, and
components in the hardware layer. There are two software pushes it away from the path. While the mobile robot is
controllers in this layer to implement real-time control of sweeping the obstacle, the robotic arm just stays in the
the two robots so that they can move to the desired same place, waiting for its peer to return. Once the vision
positions while exerting the desired contact forces. In agent reports that the obstacle has been swept away, the
addition, a computer vision program is developed to track mobile robot returns and continues the task of box
the latest positions and orientations of the robots, the transportation in cooperation with the robotic arm.
object, and the obstacles. This experiment demonstrates that the multi-agent
In the high-level coordination layer, the multi-agent robotic system developed in the present work, equipped
architecture and the learning/evolution-based algorithm with a machine-learning capability, exhibits good flexibility
are implemented to search for the optimal cooperation and adaptivity in complex environments, while effectively
strategy so that the object is transported effectively while carrying out an intended task.
avoiding the obstacles in the environment.

8. Experiments and results

Two experiments are designed to test the multi-agent


machine-learning strategy and the hybrid force/position
control scheme, which have been presented in the previous
sections.

8.1. Testing the hybrid force/position control scheme

In this experiment, the two robots are required to


move to a specified location with a specified orientation,
and to cooperatively transport the box for 8 s while
maintaining a constant contact force. This process is
shown in Fig. 14(a)–(c).
According to the control scheme given in Section 5, the
process is divided into two stages: position control and
hybrid force/position control. The former is used to move
the robots to a desired position with a desired orientation.
The latter is employed to exert a constant force in a specific
direction while the position is kept unchanged in the other
directions. Fig. 15(a)–(d) shows the control process of the
robotic arm in the first stage.
Fig. 16(a) and (b) shows the force curve and the control
input of the robotic arm under the hybrid force/position
scheme in the second stage.
From Figs. 14–16 it is observed that the hybrid force/
position control scheme and the two-step control strategy,
which are presented in Section 5, perform properly in the
experiments.

8.2. Testing the multi-agent learning strategy

This experiment is designed to test the multi-agent


learning strategy described in Section 4. In the experiment,
a removable obstacle is suddenly placed in the forward
path while the robots are cooperatively transporting the
box to the goal location. Fig. 17 shows how the robots Fig. 16. The force control of the robotic arm. (a) The force on the end
handle this task. effector. (b) The control input of the joint.
Y. Wang, C.W. de Silva / Engineering Applications of Artificial Intelligence 21 (2008) 470–484 483

Fig. 17. The robots selected the ‘‘sweeping’’ action to remove an obstacle in the path. (a) Initial state (pushing of the box). (b) Leaving the box.
(c) Removing the obstacle. (d) Returning.

9. Conclusions Cao, Y.U., Fukunaga, A.S., Kahng, A.B., 1997. Cooperative mobile
robotics: antecedents and directions. Autonomous Robots 4 (1), 7–27.
This paper presents a multi-robot object transportation Craig, J.J., 2005. Introduction to Robotics: Mechanics and Control.
Pearson Prentice-Hall, Upper Saddle River, NJ.
system developed by the authors in the IAL. A new multi- Ferch, M., Zhang, J., 2002. Learning cooperative grasping with the graph
agent architecture was designed for this purpose, which representation of a state–action space. Robotics and Autonomous
integrated RL and GAs, together with a robotic hybrid Systems 38 (2002), 183–195.
force/position control scheme. A modified Q-learning Huntsberger, T., Pirjanian, P., et al., 2003. CAMPOUT: a control
algorithm called the sequential Q-learning algorithm was architecture for tightly coupled coordination of multirobot systems for
planetary surface exploration. IEEE Transactions on Systems, Man
developed in this paper. The empirical results showed that
and Cybernetics—Part A 33 (5), 550–559.
the algorithm was particularly suitable for decision making Inoue, Y., Tohge, T., Iba, H., 2004. Object transportation by two
in multi-robot systems with resource or behavior conflicts. humanoid robots using cooperative learning. In: Proceedings of the
The details of these developments and implementation in Congress on Evolutionary Computation (CEC 2004), Portland, OR,
the object transportation system were presented. Computer 2004, pp. 1201–1208.
simulations using Java language and experimentation using Ito, K., Gofuku, A., 2004. Hybrid autonomous control for multi mobile
robots. Advanced Robotics 18 (1), 83–99.
the developed prototype were used to demonstrate the Jones, C., Mataric, M.J., 2004. Automatic synthesis of communication-
feasibility and the effectiveness of the system. based coordinated multi-robot systems. In: Proceedings of the 2004
IEEE/RSJ International Conference on Intelligent Robots and
Acknowledgments Systems, Sendai, Japan, 2004, pp. 381–387.
Karray, F.O., de Silva, C.W., 2004. Soft Computing and Intelligent
Systems Design. Pearson Addison Wesley, New York.
Funding for the work reported in this paper has come Kumar, M., Garg, D.P., 2004. Sensor-based estimation and control of
from the National Science and Engineering Research forces and moments in multiple cooperative robots. Journal of
Council (NSERC) of Canada, Canada Foundation for Dynamic Systems, Measurement, and Control, Transactions of the
Innovation (CFI), and the K.C. Wong Education Founda- ASME 126 (2004), 276–283.
tion of Hong Kong. Littman, M.L., 2001. Value-function reinforcement learning in Markov
games. Journal of Cognitive Systems Research 2 (1), 55–66.
Liu, J., Wu, J., 2001. Multi-Agent Robotic Systems. CRC Press, Boca
References Raton, FL.
Martinson, E., Arkin, R.C., 2003. Learning to role-switch in multi-
Arai, T., Pagello, E., Parker, L.E. (Eds.), 2002. Guest editorial: advances robot systems. In: Proceedings of the 2003 IEEE International
in multirobot systems, IEEE Transactions on Robotics and Automa- Conference on Robotics and Automation, Taipei, Taiwan, 2003,
tion 18 (5), 655–661. pp. 2727–2734.
Astrom, K.J., Wittenmark, B., 1994. Adaptive Control, second ed. Martinson, E., Stoytchev, A., Arkin, R.C., 2002. Robot behavioral
Addison-Wesley, Menlo Park, CA. selection using Q-learning. In: Proceedings of the 2002 IEEE/RSJ
484 Y. Wang, C.W. de Silva / Engineering Applications of Artificial Intelligence 21 (2008) 470–484

International Conference on Intelligent Robots and Systems, EPFL, Sugar, T., Desai, J.P., Kumar, V., et al., 2001. Coordination of multiple
Lausanne, Switzerland, 2002, pp. 970–977. mobile manipulators. In: Proceedings of the 2001 IEEE International
Mataric, M.J., 1997. Reinforcement learning in the multi-robot domain. Conference on Robotics and Automation, Seoul, Korea, 2001,
Autonomous Robots 4 (1), 73–83. pp. 3022–3027.
Mitchell, T.M., 1997. Machine Learning. McGraw-Hill, New York. Sugar, T.G., Kumar, V., 2002. Control of cooperating mobile manip-
Miyata, N., Ota, J., Arai, T., Asama, H., 2002. Cooperative transport by ulators. IEEE Transactions on Robotics and Automation 18 (1), 94.
multiple mobile robots in unknown static environments associated Wang, Y., de Silva, C.W., 2005. An object transportation system
with real-time task assignment. IEEE Transactions on Robotics and with multiple robots and machine learning. In: Proceedings of the
Automation 18 (10), 769–780. 2005 American Control Conference (ACC 2005), Portland, OR,
Nolfi, S., Floreano, D., 1999. Learning and evolution. Autonomous pp. 1371–1376.
Robots 7 (1), 89–113. Wang, Y., de Silva, C.W., 2006a. Multi-robot box-pushing: single-agent
Parker, L.E., 1998. ALLIANCE: an architecture for fault tolerant Q-learning vs. team Q-learning. In: Proceedings of the 2006 IEEE/RSJ
multirobot cooperation. IEEE Transactions on Robotics and Auto- International Conference on Intelligent Robots and Systems (IROS),
mation 14 (2), 220–240. Beijing, China, pp. 3694–3699.
Parker, L.E., 2000. Lifelong adaptation in heterogeneous multi-robot Wang, Y., de Silva, C.W., 2006b. Extend single-agent reinforcement
teams: response to continual variation in individual robot perfor- learning approach to a multi-robot cooperative task in an unknown
mance. Autonomous Robots 8 (2000), 239–267. dynamic environment. In: Proceedings of the IEEE 2006 International
Pereira, G., Campos, M., Kumar, V., 2004. Decentralized algorithms for Joint Conference on Neural Networks (IJCNN), Vancouver, Canada,
multi-robot manipulation via caging. The International Journal of pp. 10098–10104.
Robotics Research 23 (7–8), 783–795. Wang, Z., Takano, Y., Hirata, Y., et al., 2004. A pushing leader based
Rus, D., Donald, B., Jennings, J., 1995. Moving furniture with teams of decentralized control method for cooperative object transportation. In:
autonomous robots. In: Proceedings of the IEEE/RSJ International Proceedings of the 2004 IEEE/RSJ International Conference on
Conference on Human Robot Interaction and Cooperative Robots, Intelligent Robots and Systems, Sendai, Japan, pp. 1035–1040.
Pittsburgh, PA, 1995, pp. 235–242. Wang, Z., Hirata, Y., Kosuge, K., 2005. An algorithm for testing object
Schenker, P.L., Huntsberger, T., et al., 2003. Planetary rover develop- caging condition by multiple mobile robots. In: Proceedings of the
ments supporting MARS exploration, sample return and future 2005 IEEE/RSJ International Conference on Intelligent Robots and
human–robotic colonization. Autonomous Robots 14 (2003), 103–126. Systems, Edmont, Alta., Canada, pp. 2664–2669.
Stone, P., Veloso, M., 1998. Layered approach to learning client behaviors Yamada, S., Saito, J., 2001. Adaptive action selection without explicit
in the ROBOCUP soccer server. Applied Artificial Intelligence 12 communication for multi-robot box-pushing. IEEE Transactions on
(2–3), 165–188. Systems, Man and Cybernetics 31 (3), 398–404.
Stone, P., Veloso, M., 2000. Multiagent systems: a survey from a machine Yamashita, A., Arai, T., Ota, J., et al., 2003. Motion planning of multiple
learning perspective. Autonomous Robots 8 (3), 345–383. mobile robots for cooperative manipulation and transportation. IEEE
Stroupe, A., Huntsberger, T., et al., 2005. Behavior-based multi-robot Transactions on Robotics and Automation 19 (2), 223–237.
collaboration for autonomous construction tasks. In: Proceedings of Yang, E., Gu, D., 2004. Multiagent reinforcement learning for multi-robot
the 2005 IEEE/RSJ International Conference on Intelligent Robots systems: a survey. Technical Report. /http://robotics.usc.edu/maja/
and Systems, Edmont, Alta., Canada, 2005, pp. 1989–1994. teaching/cs584/papers/yang04multiagent.pdfS.

You might also like