Professional Documents
Culture Documents
" bbb
" " bbD b
v v
" D
" " agent at time t, with ;t the state of all agents at time t,
V1 bbb """vV
v 2
V1
bbb """vV
v 2
for non-time-varying utilities). World utility, G , is an
arbitrary function of the state of all agents across all time.
b"
vS b"vS (Note that that state is a Euclidean vector.) When is an
agent that uses a machine learning algorithm to “try to in-
crease” its private utility, we write that private utility as
Net A Net B ()
g , or more generally, to allow that utility to vary in time,
()
g; .
Figure 1: Hex network with V1 = 10x ; V2 = 50 +
; = 10 + x
Here we focus on the case where our goal, as COIN de-
x V3 signers, is to maximize world utility through the proper se-
lection of private utility functions. Intuitively, the idea is
However consider what happens when six travelers are on to choose private utilities that are aligned with world utility,
the highways in net B. If an ISPA is used to make each rout- and that also have the property that it is relatively easy for us
ing decision, then at equilibrium each of the three possible to configure each agent so that the associated private utility
paths contains two travelers.1 Due to overlaps in the paths achieves a large value.
however, this results in each traveler accruing a cost of 92 We need a formal definition of the concept of having pri-
units, which is higher than than what they accrued before vate utilities be “aligned” with G. Constructing such a for-
the new highway was built. The net effect of adding a new malization is a subtle exercise. For example, consider sys-
road is to increase the cost incurred by every traveler. tems where the world utility is the sum of the private util-
ities of the individual nodes. This might seem a reason-
The COIN Formalism able candidate for an example of “aligned” utilities. How-
One common solution to side-effect problems is to have ever such systems are examples of the more general class
certain components of the network (e.g., a “network man- of systems that are “weakly trivial”. It is well-known that
ager” (Korilis, Lazar, & Orda 1995)) dictate actions to other in weakly trivial systems each individual agent greedily try-
ing to maximize its own utility can lead to the tragedy of
1
We have in mind here the Nash equilibrium, where no traveler the commons (Hardin 1968) and actually minimize G. In
can gain by changing strategies (Fudenberg & Tirole 1991). particular, this can be the case when private utilities are in-
=P 0
=P
dependent of time and G g . Evidently, at a min- were set to ~ at time and the system evolved from there.
imum, having G g is not sufficient to ensure that
So long as we know extending over all time, , and the
we have “aligned” utilities; some alternative formalization function G, we know the value of WLU.
of the concept is needed. If our system is factored with respect to private utilities
A more careful formalization of the notion of aligned util- fg; g, we want each agent to be in a state at time that in-
ities is the concept of “factored” systems. A system is fac- duces as high a value of the associated private utility as pos-
tored at time when the following holds for each agent sible (given the initial states of the other agents). Regardless
individually: A change at time to the state of alone, =
of the system dynamics, having g; G 8 means the sys-
tem is factored at time . It is also true that regardless of
when propagated across time, will result in an increased
()
value of g; if and only if it results in an increase for =
the dynamics, g; W LUC eff 8 is a factored system at
(; )
()
G (Wolpert & Tumer 2000b). time (proof in (Wolpert & Tumer 2000b)). However, note
For a factored system, the side-effects of any change to ’s that since each agent is operating in a large system, it may
t= state that increases its private utility cannot decrease experience difficulty discerning the effects of its actions on
world utility. There are no restrictions though on the effects G when G sensitively depends on all the myriad components
of that change on the private utilities of other agents and/or of the system. Therefore each may have difficulty learning
times. In particular, we don’t preclude an agent’s algorithm from past experience what to do to achieve high g; when
at two different times from “working at cross-purposes” to g; =G.2
each other, so long as at both moments the agent is working This problem can be mitigated by using effect set WLU as
to improve G. In game-theoretic terms, in factored systems the private utility, since the subtraction of the clamped term
optimal global behavior corresponds to the agents’ always removes much of the “noise” of the activity of other agents,
being in a private utility Nash equilibrium (Fudenberg & leaving only the underlying “signal” of how the agent in
Tirole 1991). In this sense, there can be no TOC for a fac- question affects the utility. (This reasoning is formalized as
tored system. As a trivial example, a system is factored for the concept of “learnability” in (Wolpert & Tumer 2000b).)
g; = G 8 . Accordingly, one would expect that setting private utilities
( )
Define the effect set of the agent-time pair ; at , to WLU’s ought to result in better performance than having
()
ef f
C(; ) , as the set of all components 0 which under
;t
g; =G 8; .
the forward dynamics of the system have non-zero partial
=
derivative with respect to the state of agent at t . Intu- Simulation Overview
( )
itively, ; ’s effect set is the set of all components 0 ;t In this section we describe the model used in our simula-
which would be affected by a change in the state of agent tions. We then present the ISPA in terms of that model, and
at time . (They may or may not be affected by changes in apply the concepts of COIN theory to that model to derive
=
the t states of the other agents.) private utilities for each agent. Because these utilities are
Next, for any set of components ( 0 ; t), define CL () “factored” we expect that agents acting to improve their own
utilities will also improve the global utility (overall through-
as the “virtual” vector formed by clamping the components
of the vector delineated in to an arbitrary fixed value. (In put of the network). We end by describing a Memory Based
this paper, we take that fixed value to be 0 for all components (MB) machine learning algorithm that each agent uses to
listed in .) The value of the effect set wonderful life utility estimate the value that its private utility would have under
(WLU for short) for is defined as: the different candidate routing decisions. In the MB COIN
W LU ( ) G( ) , G(CL ( )): (1)
algorithm, each agent uses this algorithm to make routing
decisions aimed at maximizing its estimated utility.
In particular, we are interested in the WLU for the effect set
( )
of agent-time pair ; . This WLU is the difference be- Simulation Model
tween the actual world utility and the virtual world utility As in much of network analysis, in the model used in this pa-
( )
where all agent-time pairs that are affected by ; have per, at any time step all traffic at a router is a set of pairs of
been clamped to a zero state while the rest of is left un- integer-valued traffic amounts and associated ultimate desti-
changed. nation tags (Bertsekas & Gallager 1992). At each such time
0 ( )
Since we are clamping to ~ , we can loosely view ; ’s step t, each router r sums the integer-valued components of
effect set WLU as analogous to the change in world utility
( )
that would have arisen if ; “had never existed”. (Hence () P
its current traffic at that time step to get its instantaneous
()
load. We write that load as zr t d xr;d t , where the
the name of this utility - cf. the Frank Capra movie.) Note 2
however, that CL is a purely “fictional”, counter-factual op- In particular, in routing in large networks, having private re-
erator, in that it produces a new without taking into account wards given by the world reward functions means that to provide
each router with its reward at each time step we need to provide it
the system’s dynamics. The sequence of states the agent-
the full throughput of the entire network at that step. This is usually
time pairs in are clamped to in constructing the WLU infeasible in practice. Even if it weren’t though, using these private
need not be consistent with the dynamical laws of the sys- utilities would mean that the routers face a very difficult task in
tem. This dynamics-independence is a crucial strength of trying to discern the effect of their actions on their rewards, and
the WLU. It means that to evaluate the WLU we do not try therefore would likely be unable to learn their best routing strate-
to infer how the system would have evolved if agent ’s state gies.
()
index d runs over ultimate destinations, and xr;d t is the to- agent.3 Based on this effect set, the WLU for an agent is
tal traffic at time t going from r towards d. After its instan- given by the difference between the total cost accrued by all
taneous load at time t is evaluated, the router sends all its agents in the network and the cost accrued by agents when
traffic to the next downstream routers, according to its rout- all agents sharing the same destination as are “erased.”
ing algorithm. After all such routed traffic goes to those next More precisely, using Eq. 2, one can show that each agent
downstream routers, the cycle repeats itself, until all traffic that shares a destination d, will have the following effect
reaches its destination. In our simulations, for simplicity, set WLU:
traffic was only introduced into the system (at the source
() = ( ) , G(CLCeff ( ))
X X [z (t) V (Z (t)) ,
routers) at the beginning of successive disjoint waves of L gd G
consecutive time steps.
= r r r
X x (t) V ( X X (t))]
In a real network, the cost of traversing a router de-
t r
pends on “after-effects” of recent instantaneous loads, as
well as the current instantaneous load. To simulate this ef- r;d0 r r;d00 (3)
fect, we use time-averaged values of the load at a router d0 6=d d00 6=d
rather than instantaneous load to determine the cost a packet
incurs in traversing that router. More formally, we define Notice that the summand in Eq. 3 is computed at each router
()
the router’s windowed load, Zr t , as the running average separately from information available to that router. Subse-
timesteps: Zr t W
P1 t
( )=P
of that router’s load value over a window of the previous W
() t0 =t,W +1 zr t
0 ()
d0 Xr;d0 t ,
quently those summands can be propagated across the net-
work and the associated gd ’s “rolled up” in much the same
P ()
where the value of Xr;d t is set by the dynamical law
way as routing tables updates are propagated in current rout-
() = ( ))
ing algorithms.
1 t 0
Xr;d t W t0 =t,W +1 xr;d t . (W is always set to Unlike the ISPA, the MB COIN has only limited knowl-
an integer multiple of L.) The windowed load is the argu-
()
edge, and therefore must predict the WLU value that would
ment to a load-to-cost function, V , which provides the result from each potential routing decision. More pre-
cost accrued at time t by each packet traversing the router cisely, for each router-ultimate-destination pair, the associ-
at this timestep. That is, at time t, the cost for each packet
( ( ))
ated agent estimates the map from windowed loads on all
to traverse router r is given by V Zr t . Different routers
()
outgoing links (the inputs) to WLU-based reward (the out-
have different V , to reflect the fact that real networks have puts). This is done with a single-nearest-neighbor algorithm.
differences in router software and hardware (response time, Next, each router could send the packets along the path that
queue length, processing speed etc). For simplicity, W is the results in outbound traffic with the best (estimated) reward.
same for all routers however. With these definitions, world However to be conservative, in these experiments we instead
utility is
G ( )= P ( ) Vr (Zr (t))
t;r zr t (2)
had the router randomly select between that path and the
path selected by the ISPA (described below).
Our equation for G explicitly demonstrates that, as claimed
P
above, in our representation we can express G as () SIMULATION RESULTS
a sum of rewards, ( ) ( )
t Rt ;t , where R ;t can be
Based on the model and routing algorithms discussed above,
( )) = P ( ) (P
( )
written as function of a pair of r; d -indexed vectors:
we have performed simulations to compare the performance
( ()
Rt xr;d t ; Xr;d t r;d xr;d t Vr d0 Xr;d0 t . ( )) of ISPA and MB COIN. In all cases traffic was inserted
into the network in a regular, non-stochastic manner at the
sources. The results we report are averaged over 20 runs.
Routing Algorithms We do not report error bars as they are all lower than : . 0 05
At time step t, ISPA has access to all the windowed loads In both networks we present4 , ISPA suffers from the Braess’
1
at time step t , (i.e., it has access to Zr t ,( 1) 8r), and paradox, whereas the MB COIN almost never falls prey to
assumes that those values will remain the same at all times the paradox for those networks. For no networks we have
t. (Note that for large window sizes and times close to t, investigated is the MB COIN significantly susceptible to
this assumption is arbitrarily accurate.) Using this assump- Braess’ paradox.
tion, in ISPA, each router sends packets along the path that it
calculates will minimize the costs accumulated by its pack- Hex Network
ets.
In Table 1 we give full results for the network in Fig. 1. In
We now apply the COIN formalism to the model de-
Table 2 we report results for the same network but with load-
scribed above to derive the idealized version of our COIN
to-cost functions which incorporate non-linearities that bet-
routing algorithm. First let us identify the agents as in-
ter represent real router characteristics. (Instances of Braess’
dividual pairs of routers and ultimate destinations. So ;t
paradox are shown in bold.)
is the vector of traffic sent along all links exiting ’s router, For ISPA, although the per packet cost for loads of 1
tagged for ’s ultimate destination, at time t. Next, in order and 2 drop drastically when the new link is added, the per
to compute WLUs we must estimate the associated effect
sets. 3
Exact factoredness obtains so long as our estimated effect set
In the results presented here, the effect set of an agent is contains the true effect set; set equality is not necessary.
4
estimated as all agents that share the same destination as that See (Wolpert & Tumer 2000a) for additional experiments.
u u u u
packet cost increases for higher loads. The MB COIN on
the other hand uses the new link efficiently. Notice that the
D1
T
T T TD 2 D1
T
T T TD 2
MB COIN’s performance is slightly worse than that of the u TuV TuV u TuV TuV
ISPA in the absence of the additional link. This is caused V1
TT TT uV 2 3 V1
TTV uTT uV
2 3
Load Net ISPA MB COIN For the asymmetric traffic patterns, the added link causes
1 A 55.41 55.44 a drop in performance for the ISPA, especially for low over-
B 20.69 20.69 all traffic levels. This is not true for the MB COIN. Notice
2 A 60.69 60.80 also that in the high, asymmetric traffic regime, the ISPA
B 41.10 41.10 performs significantly worse than the MB COIN even with-
3 A 65.92 66.10 out the added link, showing that a bottleneck occurs on the
B 61.39 59.19 right side of network alone.
4 A 71.10 71.41
B 81.61 69.88 CONCLUSION
Collective Intelligence design is a framework for controlling
decentralized multi-agents systems so as to achieve a global
Butterfly Network goal. In designing a COIN, the central issue is determining
The next network we investigate is shown in Figure 2. We the private goals to be assigned to the individual agents. One
now have three sources that have to route their packets to two wants to choose those goals so that the greedy pursuit of
destinations (packets originating at S1 go to D1 , and packets them by the associated agents leads to a globally desirable
originating at S2 or S3 go to D2 ). Initially the two halves of solution. We have summarized some of the theory of COIN
the network have minimal contact, but with the addition of design and derived a routing algorithm based on application
the extra link two sources from the two halves of the network of that theory to our simulation scenario. In our simulations,
share a common router on their potential shortest path. the COIN algorithm induced costs up to 23 % lower than the
Table 3 presents results for uniform traffic through all idealized version of conventional algorithms, the ISPA. This
three sources, and then results for asymmetric traffic. For was despite the ISPA’s having access to more information
than the MB COIN. Furthermore the COIN-based algorithm Korilis, Y. A.; Lazar, A. A.; and Orda, A. 1995. Architecht-
avoided the Braess’ paradoxes that seriously diminished the ing noncooperative networks. IEEE Journal on Selected
performance of the ISPA. Areas in Communications 13(8).
In the work presented here, the COIN-based algorithm Korilis, Y. A.; Lazar, A. A.; and Orda, A. 1997. Achiev-
had to overcome severe limitations. The estimation of the ef- ing network optima using Stackelberg routing strategies.
fect sets, used for determining the private goals of the agents IEEE/ACM Transactions on Networking 5(1):161–173.
was exceedingly coarse. In addition, the learning algorithms
Sandholm, T., and Lesser, V. R. 1995. Issues in auto-
used by the agents to pursue those goals were particularly
mated negotiations and electronic commerce: extending
simple-minded. That a COIN-based router with such serious
the contract net protocol. In Proceedings of the Second In-
limitations consistently outperformed an ideal shortest path
ternational Conference on Multi-Agent Systems, 328–335.
algorithm demonstrates the strength of the proposed method.
AAAI Press.
References Subramanian, D.; Druschel, P.; and Chen, J. 1997. Ants
and reinforcement learning: A case study in routing in dy-
Bass, T. 1992. Road to ruin. Discover 56–61. namic networks. In Proceedings of the Fifteenth Interna-
Bertsekas, D., and Gallager, R. 1992. Data Networks. tional Conference on Artificial Intelligence, 832–838.
Englewood Cliffs, NJ: Prentice Hall. Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learn-
Boyan, J. A., and Littman, M. 1994. Packet routing in ing: An Introduction. Cambridge, MA: MIT Press.
dynamically changing networks: A reinforcement learning Sutton, R. S. 1988. Learning to predict by the methods of
approach. In Advances in Neural Information Processing temporal differences. Machine Learning 3:9–44.
Systems - 6, 671–678. Morgan Kaufmann.
Watkins, C., and Dayan, P. 1992. Q-learning. Machine
Cohen, J. E., and Jeffries, C. 1997. Congestion result- Learning 8(3/4):279–292.
ing from increased capacity in single-server queueing net-
works. IEEE/ACM Transactions on Networking 5(2):305– Wolpert, D. H., and Tumer, K. 2000a. Avoid-
310. ing Braess’ paradox through collective intelligence.
Available as tech. rep. NASA-ARC-IC-99-124
Cohen, J. E., and Kelly, F. P. 1990. A paradox of conges- from http://ic.arc.nasa.gov/ic/pro-
tion in a queuing network. Journal of Applied Probability jects/coin pubs.html.
27:730–734.
Wolpert, D. H., and Tumer, K. 2000b. An In-
Fudenberg, D., and Tirole, J. 1991. Game Theory. Cam- troduction to Collective Intelligence. In Bradshaw,
bridge, MA: MIT Press. J. M., ed., Handbook of Agent technology. AAAI
Hardin, G. 1968. The tragedy of the commons. Science Press/MIT Press. Available as tech. rep. NASA-ARC-IC-
162:1243–1248. 99-63 from http://ic.arc.nasa.gov/ic/pro-
Heusse, M.; Snyers, D.; Guerin, S.; and Kuntz, P. 1998. jects/coin pubs.html.
Adaptive agent-driven routing and load balancing in com- Wolpert, D. H.; Tumer, K.; and Frank, J. 1999. Using
munication networks. Advances in Complex Systems collective intelligence to route internet traffic. In Advances
1:237–254. in Neural Information Processing Systems - 11, 952–958.
Huhns, M. E., ed. 1987. Distributed Artificial Intelligence. MIT Press.
London: Pittman. Wolpert, D. H.; Wheeler, K.; and Tumer, K. 2000. Collec-
Kaelbing, L. P.; Littman, M. L.; and Moore, A. W. 1996. tive intelligence for control of distributed dynamical sys-
Reinforcement learning: A survey. Journal of Artificial tems. Europhysics Letters 49(6).
Intelligence Research 4:237–285.