Professional Documents
Culture Documents
A COMPARISON OF
Q-L EARNING AND C LASSIFIER S YSTEMS
Marco Dorigo + ,* and Hugues Bersini*
* IRIDIA
- Université Libre de Bruxelles
Avenue Franklin Roosvelt 50, CP 194/6, 1050 Bruxelles, Belgium
bersini, mdorigo@ulb.ac.be
+
Progetto di Intelligenza Artificiale e Robotica
Dipartimento di Elettronica e Informazione, Politecnico di Milano
Piazza Leonardo da Vinci 32, 20133 Milano, Italy
dorigo@elet.polimi.it
Classifier
set
ML
e
ML Conflict
a resolution
ML
i
Sensing Sensing
Action
ML ML ML
e i a
a) b)
In CSs generalization has been addressed by the use of 5 . The double plasticity
the don't care (#) operator. A classifier attribute occupied by
a don't care is considered to be irrelevant, and it is therefore In this section we discuss the last constraint we introduced
by means of don't cares that generalization over classes of to get VSCS, that is constraint (iv) of Section 1, regarding
equivalent state-action pairs is achieved. the coverage of state-action space. Q-learning and CS are
However, it is still an open problem how to develop set both concerned with populations of agents which collec-
of rules which can effectively exploit this generalization tively achieve a certain objective: they are adaptive and dis-
possibility (so called default hierarchies, in which general tributed types of control. Holland's biological inspiration,
rules cover default situations, and specialized rules are that is ecosystems and genetic mechanisms, led him to ad-
activated by exceptions, see Riolo, 1987; 1989). Moreover, dress structural plasticity by the GA. However, an ecosys-
CSs are inherently discrete systems (although fuzzy-CSs tem does not perform any kind of distributed control;
have been proposed to deal with continuous valued inputs instead, each individual selfishly tries to survive in an envi-
and outputs, see (Valenzuela-Rendón, 1991; Bonarini, ronment constituted by the other individuals. A perhaps
1993)), which makes it difficult to achieve the kind of more adequate biological inspiration could have been the
generalization in which a similar input should cause a immune system in which new cells are produced at a very
similar output. This is more easily achieved in Q-learning high speed (Bersini, 1993; Bersini and Varela, 1994) to
systems enriched by neural nets, like Lin's system (1992). improve the capacity of the system to achieve, as a whole
Observing reinforcement learning algorithms, we are in and in an adaptive way, its vital functions. We hypothesize
presence of the two traditional and opposite ways to obtain that, when interested in the collective performance of a sys-
partitions of the input space: bottom-up and top-down tem, its structural adjustments should aim at compensating
(Mitchell, 1982). In the bottom-up approach, the input for the current weakest parts of the system. This is in con-
space is initially very fine grained, and the partitions are trast with Holland's GA approach in which the generation of
created by clustering together input regions showing similar new actors is biased towards the best ones.
properties. In the top-down approach, the initial input space In CS, the GA is responsible for two types of structural
is uniformly coded and is progressively subdivided in finer changes, depending on whether it applies to the condition or
parts. The bottom-up type of generalization has been inves- to the consequent part of the classifiers: a change in the
tigated by Q-learning users following two approaches. The coding of the state space and a change in the set of actions.
first one is the already cited approach of Lin (1992), who These two types of changes have also been investigated in
uses a neural net for each action. In the second, proposed by the Q-learning community, but within the "collective and
Mahadevan and Connell (1992), state-action points are clus- compensatory" perspective just presented. Aware of the
tered using statistical clustering methods. Since top-down exponential dependency of the search duration with respect
methods amount to a progressive structural modification of to the size of the state-action space, different methods have
the problem space, they will be discussed in the next been proposed in order to progressively divide the space so
chapter. to obtain a final solution in the smallest possible space.
Chapman and Kaelbling's G-algorithm (1991) recursively
splits the state space using statistical measures of differences
in reinforcements received. Munos, Patinel and Goyeau
(1994) make a similar recursive splitting of the state space We have also seen that both models are subject to research
following a different criterium. They split an action-state directed to enrich them with more powerful capabilities to
pair if the action acting in that particular state presents vari- make them useful for real world applications. These
ations of great amplitude, indicating that the reinforcement researches bring the D MAX-VSCS back to its original form,
received by that action is highly fluctuating and then namely Holland's CS; on the other hand, Q-learning is
requires a subdivision of the state (a similar approach was being enriched by functionally similar mechanisms. We
taken by Dorigo, 1993, in his CS called ALECSYS). This have seen that the CS model is an inherently discrete model
is typical of a mechanism which, contrary to GA, focuses (but see Valenzuela-Rendón (1991), and Bonarini (1993) for
on the current unsatisfactory part of the system to guide its fuzzy implementations of the CS), while Q-learning can
improvement by a structural addition. smoothly go from the discrete implementation (e.g.,
Regarding the action part, Bersini (1993) has imple- Watkin's tabular Q-learning or other more efficient kinds of
mented an adaptive control strategy in which the adaptation discrete partition of the state space like Chapman and
relies on the parametric and structural types of change occur- Kaelbling (1991) or Munos et al. (1994)), to continuous
ring at different time scales (the inspiration is the immune neural net implementations. Regarding generalization capa-
system two-level adaptive strategy). First, the Q-learning bilities the two models have followed rather different direc-
tries to find for each state the best actions among a prelimi- tions, with CSs using don't cares symbols, and Q-learning a
nary set of actions (the same in each state) during a certain whole set of different techniques. Both have tackled the
number of Q-learning iterations. Then, after a certain num- short-term memory problem with functionally similar
ber of steps, new actions are recruited in the states in which mechanisms: internal messages in CSs, recurrent nets in Q-
the current actions show bad Q-values as compared to other learning.
actions acting in the same state or actions acting in neigh- Finally, both have proposed the idea that to build work-
boring states. The new actions are selected so as to be the ing systems it could be useful to use a divide and conquer
opposite of bad ones they replace (for instance move left in- approach, in which the global task is decomposed into many
stead of move right or a negative value instead of a positive simpler tasks which are then used as modules of a designed
one) and they are given the Q-value of the best action architecture. This can be found in the work of Dorigo and
already present in the state (so that new actions are immedi- Schnepf (1993), Dorigo (1992), Dorigo and Colombetti
ately tested). Q-learning is then activated again on the basis (1994), Colombetti and Dorigo (1992), for CSs, and in the
of the new set of actions. work of Lin (1993), Dayan and Hinton (1992), and
Mahadevan and Connell (1992), for Q-learning. In all these
6 . Conclusions works the proposed architecture is a hierarchic architecture in
which simple tasks are learned by reinforcement learning and
We have seen that DMAX-VSCS and Q-learning share coordination is designed (in Dorigo's and Lin's work also
• the class of problems to which they are applied, coordination is learned).
• the apportionment of credit mechanism. Finally, in Table 1 we summarize the characteristics of
the two models as discussed in the paper.