Professional Documents
Culture Documents
Abstract others are doing or plan to do, and from sending them
information to influence what they do.
Researchers in the field of Distributed Artificial
Coordination of problem solvers, both selfish and
Intelligence (DAI) h ave been developing efficient
mechanisms to coordinate the activities of multi- cooperative, is a key issue to the design of an effec-
ple autonomous agents. The need for coordina- tive distributed AI system. The search for domain-
tion arises because agents have to share resources independent coordination mechanisms has yielded
and expertise required to achieve their goals. some very different, yet effective, classes of coordina-
Previous work in the area includes using sophis- tion schemes. Almost all of the coordination schemes
ticated information exchange protocols, investi- developed to date assume explicit or implicit sharing
gating heuristics for negotiation, and developing of information. In the explicit form of information
formal models of possibilities of conflict and co-
sharing, agents communicate partial results (Durfee &
operation among agent interests. In order to han-
Lesser 1991), speech acts (Cohen & Perrault 1979), re-
dle the changing requirements of continuous and
source availabilities (Smith 1980), etc. to other agents
dynamic environments, we propose learning as a
means to provide additional possibilities for effec- to facilitate the process of coordination. In the im-
tive coordination. We use reinforcement learning plicit form of information sharing, agents use knowl-
techniques on a block pushing problem to show edge about the capabilities of other agents (Fox 1981;
that agents can learn complimentary policies to Genesereth, Ginsberg, & Rosenschein 1986) to aid lo-
follow a desired path without any knowledge cal decision-making. Though each of these approaches
about each other. We theoretically analyze and has its own benefits and weaknesses, we believe that
experimentally verify the effects of learning rate the less an agent depends on shared information, and
on system convergence, and demonstrate benefits
the more flexible it is to the on-line arrival of problem-
of using learned coordination knowledge on simi-
solving and coordination knowledge, the better it can
lar problems. Reinforcement learning based coor-
dination can be achieved in both cooperative and
adapt to changing environments.
non-cooperative domains, and in domains with In this paper, we discuss how reinforcement learning
noisy communication channels and other stochas- techniques of developing policies to optimize environ-
tic characteristics that present a formidable chal- mental feedback, through a mapping between percep-
lenge to using other coordination schemes. tions and actions, can be used by multiple agents to
learn coordination strategies without having to rely on
shared information. These agents, though working in
Introduction
a common environment, are unaware of the capabili-
In this paper, we will be applying recent research de- ties of other agents and may or may not be cognizant
velopments from the reinforcement learning literature of goals to achieve. We show that through repeated
to the coordination problem in multiagent systems. In problem-solving experience, these agents can develop
a reinforcement learning scenario, an agent chooses ac- policies to maximize environmental feedback that can
tions based on its perceptions, receives scalar feedbacks be interpreted as goal achievement from the viewpoint
based on past actions, and is expected to develop a of an external observer. This research opens up a
mapping from perceptions to actions that will maxi- new dimension of coordination strategies for multia-
mize feedbacks. Multiagent systems are a particular gent systems.
type of distributed AI system (Bond & Gasser 1988),
in which autonomous intelligent agents inhabit a world
with no global control or globally consistent knowledge.
Acquiring coordination knowledge
These agents may still need to coordinate their activi- Researchers in the field of machine learning have inves-
ties with others to achieve their own local goals. They tigated a number of schemes for using past experience
could benefit from receiving information about what to improve problem solving behavior (Shavlik & Diet-
426 DistributedAI
terich 1990). A number of these schemes can be effec- icy ?r in state s, and y is a discount rate (0 < y < 1).
tively used to aid the problem of coordinating multiple Various reinforcement learning strategies have been
agents inhabiting a common environment. In cooper- proposed using which agents can can develop a pol-
ative domains, where agents have approximate models icy to maximize rewards accumulated over time. For
of the behavior of other agents and are willing to re- our experiments, we use the Q-learning (Watkins 1989)
veal information to enable the group perform better as algorithm, which is designed to find a policy 7r* that
a whole, pre-existing domain knowledge can be used maximizes VTT(s) for all states s E S. The decision
inductively to improve performance over time. On the policy is represented by a function, Q : S x A H
other hand, learning techniques that can be used in- !I?, which estimates long-term discounted rewards for
crementally to develop problem-solving skills relying each state-action pair. The Q values are defined as
on little or no pre-existing domain knowledge can be &,“(~,a) = VI,?;“(s), where a;~ denotes the event se-
used by both cooperative and non-cooperative agents. quence of choosing action a at the current state, fol-
Though the latter form of learning may be more time- lowed by choosing actions based on policy 7r. The ac-
consuming, it is generally more robust in the presence tion, a, to perform in a state s is chosen such that it
of noisy, uncertain, and incomplete information. is expected to maximize the reward,
Previous proposals for using learning techniques to
coordinate multiple agents have mostly relied on using v;‘(s)= Y~L&;* (s, a) for all s E S.
prior knowledge (Brazdil et al. 1991), or on coop-
If an action a in state s produces a reinforcement of R
erative domains with unrestricted information sharing
and a transition to state s’, then the corresponding Q
(Sian 1991). Even previous work on using reinforce-
value is modified as follows:
ment learning for coordinating multiple agents (Tan
1993; Weil31993) h ave relied on explicit information Q(s,
4 + (1-P)
Q(s,
a>+@
(R+y
rg~Q(s',
a')).
(1)
sharing. We, however, concentrate on systems where
agents share no problem-solving knowledge. We show lock pushing problem
that although each agent is independently optimizing
To explore the application of reinforcement learning
its own environmental reward, global coordination be-
in multi-agent environments, we designed a problem
tween multiple agents can emerge without explicit or
in which two agents, al and ~2, are independently as-
implicit information sharing. These agents can there-
signed to move a block, b, from a starting position,
fore act independently and autonomously, without be-
S, to some goal position, G, following a path, P, in
ing affected by communication delays (due to other
Euclidean space. The agents are not aware of the ca-
agents being busy) or failure of a key agent (who con-
pabilities of each other and yet must choose their ac-
trols information exchange or who has more informa-
tions individually such that the joint task is completed.
tion), and do not have to be worry about the reliability
The agents have no knowledge of the system physics,
of the information received (Do I believe the informa-
but can perceive their current distance from the de-
tion received? Is the communicating agent an accom-
sired path to take to the goal state. Their actions are
plice or an adversary?). The resultant systems are,
therefore, robust and general-purpose. restricted as follows; agent i exerts a force Fi, where
0 5 161 I cnax, on the object at an angle t9i, where
Reinforcement learning 0 5 0 5 7r. An agent pushing with force ? at angle
In reinforcement learning problems (Barto, Sutton, 0 will offset the block in the x direction by IF] cos(0)
& Watkins 1989; Holland 1986; Sutton 1990), reac- units and in the y direction by IF] sin(e) units. The net
tive and adaptive agents are given a description of resultant force on the block is found by vector addition
the current state and have to choose the next action of individual forces: F = Fi + &. We calculate the
from a set of possible actions so as to maximize a new position of the block by assuming unit displace-
scalar reinforcement or feedback received after each ac- ment per unit force along the direction of the resultant
tion. The learner’s environment can be modeled by force. The new block location is used to provide feed-
a discrete time, finite state, Markov decision process buck to the agent. If (x, y) is the new block location,
that can be represented by a 4-tuple (S, A, P, r) where PE(y) is the x-coordinate of the path P for the same y
P:SxSxA ++ [0, l] gives the probability of mov- coordinate, Ax = ]Z - P$(y)l is the distance along the
ing from state s1 to s2 on performing action a, and x dimension between the block and the desired path,
r : S x A H % is a scalar reward function. Each then K * ueax is the feedback given to each agent for
agent maintains a policy, 7r, that maps the current their last action (we have used K = 50 and a = 1.15).
state into the desirable action(s) to be performed in The field of play is restricted to a rectangle with
that state. The expected value of a discounted sum endpoints [0, 0] and [loo, 1001. A trial consists of the
of future rewards of a policy n at a state x is given agents starting from the initial position S and applying
by V” def E(CE, ytrF t}, where rr t is the random forces until either the goal position G is reached or
varia 1 le corresponding to the reward received by the the block leaves the field of play (see Figure 1). We
learning agent t time steps after if starts using the pol- abort a trial if a pre-set number of agent actions fail to
Coordination 427
field of play
.“““““““““‘““““““““““’
.
agent 1
at each time step each agent pushes with some force at a cerlain angle
Combined effect
Fsin 8 F
Figure 2: The X/Motif interface for experimentation.
428 DistributedAI
Solving for Qss in this equation yields a value of &. should be proportional to the number of trials to con-
In order for the agents to explore all actions after the vergence for a given run.
Q-values are initialized at Sr, we require that any new In the following, let St be the Q-value after t updates
Q value be less than Sl. From similar considerations and ,S’I be the initial Q-value. Using Equation- 1 and
as above we can show that this will be the case if Sl 2 the fact that for the optimal action at the starting
5. In our experiments we fix the maximum reward position, the reinforcement received is K and the next
K at 50, Sl at 100, and y at 0.9. Unless otherwise state is the same-as the current state, we can write,
mentioned, we have used ,B = 0.2, and allowed each St+1 = (l-P)St+P(K+ySt)
agent to vary both the magnitude and angle of the
force they apply on the block.
= (1 -P(1 -r))St +prc
The first problem we used had starting and goal po- = ASt+C (2)
sitions at (40,O) and (40,100) respectively. During our where A and B are constants defined to be equal to
initial experiments we found that with an even number 1-/3*(1-y) and/?*K respectively. Equation 2 is a
of discrete intervals chosen for the angle dimension, an difference equation which can be solved using Sc = Sl
agent cannot push along any line parallel to the y-axis. to obtain
Hence we used an odd number, 11, of discrete intervals
for the angle dimension. The number of discrete inter- St = A
t+1 sl+ C(1 --At+7
Coordination 429
Varying beta
4.5
430 DistributedAI
of the possible states in the world. The action choice A. H. Bond and L. Gasser. Readings in Distributed
for each agent at a state is represented by a straight Artificial Intelligence. Morgan Kaufmann Publishers,
line at the appropriate angle and scaled to represent San Mateo, CA, 1988.
the magnitude of force. We immediately notice that P. Brazdil, M. Gams, S. Sian, L. Torgo, and W. van de
the individual policies are complimentary rather than Velde. Learning in distributed systems and multi-
being identical. Given a state, the combination of the agent environments. In European Working Session
best actions will bring the block closer to the desired on Learning, Lecture Notes in AI, 482, Berlin, March
path. In some cases, one of the agents even pushes in 199 1. Springer Verlag .
the wrong direction while the other agent has to com-
P. R. Cohen and C. R. Perrault. Elements of a
pensate with a larger force to bring the block closer to
plan-based theory of speech acts. Cognitive Science,
the desired path. These cases occur in states which are
3(3):177-212, 1979.
at the edge of the field of play, and have been visited
only infrequently. Complementarity of the individual E. H. Durfee and V. R. Lesser. Partial global plan-
policies, however, are visible for all the states. ning: A coordination framework for distributed hy-
pothesis formation. IEEE Transactions on Systems,
Conclusions Man, and Cybernetics, 21(5), September 1991.
In this paper, we have demonstrated that two agents M. S. Fox. An organizational view of distributed sys-
can coordinate to solve a problem better, even without tems. IEEE Transactions on Systems, Man, and Cy-
a model for each other, than what they can do alone. bernetics, 11(1):70-80, Jan. 1981.
We have developed and experimentally verified theo- M. Genesereth, M. Ginsberg, and J. Rosenschein. Co-
retical predictions of the effects of a particular system operation without communications. In Proceedings
parameter, the learning rate, on system convergence. of the National Conference on Artificial Intelligence,
Other experiments show the utility of using knowledge, pages 51-57, Philadelphia, Pennsylvania, 1986.
acquired from learning in one situation, in other similar J. H. Holland. Escaping brittleness: the possibili-
situations. Additionally, we have demonstrated that ties of general-purpose learning algorithms applied to
agents coordinate by learning complimentary, rather parallel rule-based systems. In R. Michalski, J. Car-
than identical, problem-solving knowledge. bonell, and T. M. Mitchell, editors, Machine Learn-
The most surprising result of this paper is that ing, an artificial intelligence approach: Volume II.
agents can learn coordinated actions without even be- Morgan Kaufman, Los Alamos, CA, 1986. c
ing aware of each other! This is a clear demonstration
J. W. Shavlik and T. G. Dietterich. Readings in Ma-
of the fact that more complex system behavior can
chine Learning. Morgan Kaufmann, San Mateo, Cal-
emerge out of relatively simple properties of compo-
ifornia, 1990.
nents of the system. Since agents can learn to co-
ordinate behavior without sharing information, this S. Sian. Adaptation based on cooperative learning
methodology can be equally applied to both cooper- in multi-agent systems. In Y. Demazeau and J.-P.
ative and non-cooperative domains. Miiller, editors, Decentralize AI, volume 2, pages 257-
To converge on the optimal policy, agents must re- 272. Elsevier Science Publications, 199 1.
peatedly perform the same task. This aspect of the R. G. Smith. Th e contract net protocol: High-level
current approach to agent coordination limits its ap- communication and control in a distributed prob-
plicability. Without an appropriate choice of system lem solver. IEEE Transactions on Computers, C-
parameters, the system may take considerable time to 29(12):1104-1113, Dec. 1980.
converge, or may not converge at all. R. S. Sutton. Integrated architecture for learning,
In general, agents converged on sub-optimal poli- planning, and reacting based on approximate dy-
cies due to incomplete exploration of the state space. namic programming. In Proceedings of the Seventh
We plan to use Boltzmann selection of action in place International Conference on Machine Learning, pages
of deterministic action choice to remedy this problem, 216-225, 1990.
though this will lead to slower convergence. We also
M. Tan. Multi-agent reinforcement learning: Inde-
plan to develop mechanisms to incorporate world mod-
pendent vs. cooperative agents. In Proceedings of the
els to speed up reinforcement learning as proposed by
Tenth International Conference on Machine Learn-
Sutton (Sutton 1990). We are currently investigating
ing, pages 330-337, June 1993.
the application of reinforcement learning for resource-
sharing problems involving non-benevolent agents. C. Watkins. Learning from Delayed Rewards. PhD
thesis, King’s College, Cambridge University, 1989.
eferences G. WeiI3. Learning to coordinate actions in multi-
A. B. Barto, R. S. Sutton, and C. Watkins. Sequen- agent systems. In Proceedings of the International
tial decision problems and neural networks. In Pro- Joint Conference on Artificial Intelligence, pages
ceedings of 1989 C on f erence on Neural Information 311-316, August 1993.
Processing, 1989.
Coordination 431