You are on page 1of 12

Data mining neural networks with genetic algorithms

Ajit Narayanan, Edward Keedwell and Dragan Savic


School of Engineering and Computer Science
University of Exeter
Exeter EX4 4PT
United Kingdom
ajit@dcs.ex.ac.uk
tel: (+)1392 264064
Abstract

It is an open question as to what is the best way to extract symbolic rules from trained
neural networks in domains involving classification. Previous approaches based on an
exhaustive analysis of network connection and output values have already been
demonstrated to be intractable in that the scale-up factor increases exponentially with the
number of nodes and connections in the network. A novel approach using genetic
algorithms to search for symbolic rules in a trained neural network is demonstrated in
this paper. Preliminary experiments involving classification are reported here, with the
results indicating that our proposed approach is successful in extracting rules. While it is
accepted that further work is required to convincingly demonstrate the superiority of our
approach over others, there is nevertheless sufficient novelty in these results to justify
early dissemination. (If the paper is accepted, the latest results will be reported, together
with sufficient information to aid replicability and verification.)

Introduction

Artificial neural networks (ANNs) are increasingly used in problem domains


involving classification. They are adept at finding commonalities in a set of seemingly
unrelated data and for this reason are used in a growing number of classification tasks.
Unfortunately, a commonly perceived problem with ANNs when used for classification is
that, while a trained ANN can indeed classify the data, sometimes with more accuracy
than a traditional, symbolic machine learning approach, the reasons for their
classification cannot be found easily. Trained ANNs are commonly perceived to be
‘black boxes’ which map input data onto a class through a number of mathematically
weighted connections between layers of neurons. While the idea of ANNs as black boxes
may not be a problem in applications where there is little interest in the reasons behind
classification, this can be a major obstacle in applications where it is important to have
symbolic rules or other forms of knowledge structure, such as identification or decision
trees, which are easily interpretable by human experts. In particular, it may be important
to identify knowledge not previously known to domain experts and which may therefore
lie at the periphery of domain expertise. Also, safety-critical systems (such as air traffic
control or missile firing) which use neural networks successfully to classify data face
difficulty in being accepted because of the reluctance by managers and administrators to
accept a system which is not open to symbolic verification. Often, there is a legal

1
requirement that such safety-critical systems be demonstrated to be correct to a certain
degree of confidence. It is often claimed that neural networks, because of their plasticity
and use of soft constraints, can handle noisy data better than their symbolic counterparts
and should therefore be used precisely in those areas which are likely to benefit most
from their application, such as safety-critical systems and data mining.

In general, an ANN can be said to make its decisions by using the activation of the units
(input and hidden) combined with the weights of the connections between these units.
The topology of the network can also be used. Andrews et al. (1996) identify three types
of rule extraction techniques: ‘decompositional’, ‘pedagogical’ and ‘eclectic’, each of
which refers to a different method of extracting information from the network. A
decompositional approach is distinguished by its focus on extracting rules at the level of
individual (hidden and output) units. The computed output from each hidden and output
unit is mapped onto a binary ‘yes/no’ outcome corresponding to the notion of a rule
consequent. The major problem with this approach is the apparent exponential behaviour
of associated algorithms (Towell and Shavlik, 1993). Extracting rules from complex
ANNs may therefore be intractable. A pedagogical approach is distinguished by its
treatment of a trained ANN as a ‘black box’ where the knowledge to be extracted deals
directly with the way that input is mapped onto output by the internal weights (i.e. no
‘yes/no rules’ are extracted – just rules dealing with the changes in the levels of the input
and output units). The major problem with this approach is the sheer number of rules
generated for even the simplest domains. Finally, the eclectic approach is characterised
by any use of knowledge concerning the internal architecture and/or weight vectors in a
trained ANN to complement a symbolic learning algorithm. There is currently very little
understanding of available methods for constructing an eclectic approach, of the domains
where eclectic approaches may outperform their traditional symbolic and ANN
counterparts, and how to evaluate the results of an eclectic approach.

In this paper we propose a novel, evolutionary eclectic approach which integrates


traditional ANNs with genetic algorithms for extracting simple, intelligible and useful
rules from trained ANNs. It is claimed that this approach adopts the advantages of ANNs
(gradual, incremental training which overcomes inconsistencies and ambiguities in the
data) as well as symbolic learning (intelligible output, rules for verification). In brief, the
paper proposes the use of a genetic algorithm to search the weight space of a trained
neural network to identify the best rules for classification. The genetic algorithm uses
chromosomes which can be mapped directly onto intelligible rules (phenotypes).

Two major constraints are the following. First, the goal of many rule-extraction
techniques is to find a comprehensive rule base for the network so that it can be encoded
as a set of ‘expert system’ rules in which the attributes causing a particular classification
can be precisely and fully determined. In this paper we propose that this is not necessary
in the majority of applications. Algorithms attempting to produce comprehensive rule sets
have a tendency to become exponential in complexity as network size increases. This
has been recognised by researchers, and in a recent paper (Arbatli and Akin, 1997) the
search space available to the symbolic algorithm has been decreased by optimizing the
topology of the network using genetic algorithms. The approach described here differs in

2
that it uses GAs to search a trained neural network for the extraction of symbolic rules
directly and not to optimise the network for another set of rule extraction techniques to
be applied. Secondly, the experiments below have been performed on categorical rather
than continuous data. Many datasets of significance in the real world do indeed have
continuous attributes, but datasets with large numbers of unpartitioned continuous
attributes are unlikely to be successfully classified by a neural network in any case.

The Genetic Algorithm/Neural Network System

The starting point of any rule-extraction system is firstly to train the network on
the data required, i.e. the ANN is trained so that a satisfactory error level is reached. For
classification problems, each input unit typically corresponds to a single feature in the
real world, and each output unit to a class value or class. The first objective of our
approach is to encode the network in such a way that a genetic algorithm can be run over
the top of it. This is achieved by creating an n-dimensional weight space where n is the
number of layers of weights. The network can be represented by simply enumerating
each of the nodes and/or connections. For example, Figure 1 depicts a simple neural
network with five input units (input features, data attributes), three hidden units, and one
output unit (class or class value), with each node enumerated in this case except the
output. Typically, there will be more than one output class or class value and therefore
more than one output node.

Figure 1 - A typical encoding of a simple


neural network with only one class value
(one output node)

From this encoding, genes can be created which, in turn, are used to construct
chromosomes where there is at least one gene representing a node at the input layer and
at least one gene representing a node at the hidden layer. A typical chromosome for the
network depicted in Figure 1 could look something like this (Figure 2):

Figure 2 - A typical chromosome generated from


the encoded network for only one class value

3
This chromosome corresponds to the fifth unit in the input layer and the third unit in the
hidden layer. That is, the first gene contains the weight connecting input node 5 to hidden
unit 3, and the second gene contains the weight connecting hidden unit 3 to the output
class. Fitness is computed as a direct function of the weights which the chromosome
represents. For chromosomes containing just two genes (one for the input unit, the other
for the hidden unit), the fitness function is:

Fitness = Weight(Input→Hidden)*Weight(Hidden→Output)

where ‘→’ signifies the weight between the two enumerated nodes. So the fitness of the
chromosome in Figure 2 is:

Fitness = Weight(5→3)*Weight(3→Output)

This fitness is computed for an initial set of random chromosomes, and the population is
sorted according to fitness. An elitist strategy is then used whereby a subset of the top
chromosomes is selected for inclusion in the next generation. Crossover and mutation
are then performed on these chromosomes to create the rest of the next population.

The chromosome is then easily converted into IF…THEN rules with an attached
weighting. This is achieved by using the template: ‘IF <gene1> THEN output is
<class> (weighting)’, with the weighting being the fitness of the gene and the class
signifies which output unit is being switched on. The weighting is a major part of the rule
generation procedure because the value of this is a direct measure of how the network
interprets the data. Since ‘Gene 1’ above corresponds to the weight between an input unit
and a hidden unit, the template is essentially stating that the consequent of the rule is
caused by the activation on that particular input node and its connection to a hidden unit
(not specified explicitly in the rule). The rule template above therefore allows the
extraction of single-condition rules. The number of extracted rules in each population can
be set by the user, according to the complexity of the network and/or the data. A larger
number of rules will yield less fit chromosomes and thus less important rules. This
property is essential in extracting rules which represent knowledge at the periphery of
expertise.

Experimentation

Three experiments are described here. The first two experiments use a toy
example to show that our approach can find rules comparable to those found with purely
symbolic methods of data-mining. The third experiment was performed on a larger data
set to show that this method is generalisable to real-world domains. All GA programs are
written in C++. Neural network packages used were Neurodimensions’ Neurosolutions
v3.0 and Thinkspro v1.05 by Logical Designs Consulting.

Experiment 1

4
The dataset refers to named individuals for whom there are four attributes and
two possible class values (Figure 3 - adapted from Winston, 1992):

Name Hair Height Weight Lotion Result


Sarah Blonde Average Light No Sunburned
Dana Blonde Tall Average Yes Not sunburned
Alex Brown Short Average Yes Not sunburned
Annie Blonde Short Average No Sunburned
Emily Red Average Heavy No Sunburned
Pete Brown Tall Heavy No Not sunburned
John Brown Average Average No Not sunburned
Katie Blonde Short Light Yes Not sunburned
Figure 3 - The Sunburn Dataset

This dataset is converted as follows into a form suitable for input to the ANN (Figure 4):

Hair Blonde 100


Brown 010
Red 001
Height Short 100
Average 010
Tall 001
Weight Light 100
Average 010
Heavy 001
Lotion No 10
Yes 01
Class Sunburned 10
Not sunburned 01
Figure 4 - Neural Network Conversion of Data in Figure 4.

One example of input is therefore: 10001010010, which represents a blonde haired (100),
average height (010), light (100), no-lotion used (10) individual (i.e. Sarah). Note that we
are dealing with a supervised learning network, where the class in which the sample falls
is explicitly represented for training purposes. So, in the case of Sarah, the output 10
(sunburned) is used for supervised training. ‘10’ here signifies that the first output node is
switched on and the second is not. A neural network with 11 input, 5 hidden and 2
output units was created. The input to the network was a string of 0’s and 1’s which
corresponded to the records in the data set above. The network was then trained (using
back-propagation) until a mean square error of 0.001 was achieved. The network
weights were then recorded and the genetic algorithm process started. The weights
between the 11 input and 5 hidden units are as follows:

Hidden Unit 1 (all eleven input units):


-2.029721 1.632389 -1.702274 -1.369853 0.133539 0.296253 -0.465295 0.680639 -0.610233 -1.432447 -1.462687

Hidden Unit 2:
0.960469 1.304169 -0.558034 -0.870080 0.394558 0.537783 0.047991 0.575487 -1.571345 0.476647 -0.003466

5
Hidden Unit 3:
0.952550 -2.791922 1.133562 0.518217 1.647397 -1.801673 -1.518900 -0.245973 0.450328 -0.169588 -1.979129

Hidden Unit 4:
-1.720175 1.247111 1.095436 0.365523 0.350067 0.584151 0.773993 1.216627 -1.174810 -1.624518 2.342727

Hidden Unit 5:
-1.217552 2.288170 -1.088214 -0.389681 -0.919714 1.168223 0.579115 1.039906 1.499586 -2.902985 2.754642

The weights between the five hidden units and the two output units are as follows:
Output Unit 1 (all 5 hidden units):
-2.299536 -0.933331 2.137592 -2.556154 -4.569341

Output Unit 2:
2.235369 -0.597022 -3.967368 1.887921 3.682286

A random number generator was used to create the initial population of five
chromosomes for the detection of rules, where an extra gene is added to the end of the
chromosome to represent one of the two output class values. The alleles for this gene are
either 1 or 2 (to represent the output node values of 10 (sunburned) and 01 (not
sunburned).

The following decisions were taken:

1. The fittest chromosome of each generation goes through to the next generation
2. The next chromosome is chosen at random, but a greater fitness gives a greater
chance of being chosen. Negative fitnesses were not included. (A ‘roulette wheel’
selection.)
3. The remaining four chromosomes are created as a mutation of the two chosen above
and crossover on these same two. Duplicate chromosomes are removed.
4. Fitness was computed simply as Weight(input_to_hidden)*Weight(hidden_to_output).
The more positive the number, the greater the fitness.

An example run (first three generations only) for extracting rules dealing with the first
output node only (i.e. for sunburn cases only) is given in Figure 5.

Results

A traditional symbolic learning algorithm running on this dataset will find the following
four rules: (a) If person has red hair then person is sunburned; (b) If person is brown
haired then person is not sunburned; (c) If person has blonde hair and no lotion used then
person is sunburned; and (d) If person has blonde hair and lotion used then person is not
sunburned. Our approach identified the following five single condition rules in ten
generations, with a maximum population of 6 in each generation:

(i) ‘IF unit1 is 1 THEN output is 1 (fitness 4.667)’, which corresponds to: ‘IF hair
colour=blonde THEN result is sunburned’. The fitness here is calculated as follows:
input unit 1 to hidden unit 1 weight of
-2.029721∗ hidden unit 1 to output unit 1 weight of -2.299536.

6
Figure 5 – First three generations of chromosome evolution in the extraction of rules
dealing with sunburn cases (output node 1) only

(ii) ‘IF unit


3 is 1 THEN output is 1 (fitness 3.908)’, which corresponds to ` IF hair
colour=red THEN result is sunburned’ (input unit 3 to hidden unit 1 weight of -1.702274
∗ hidden unit 1 to output unit 1 weight of -2.299536).

(iii) ‘IF unit 10 is 1 then output is 1 (fitness 4.154), which corresponds to ‘IF no lotion
used THEN result is sunburned’ (input unit 10 to hidden unit 4 weight of -1.624518 ∗
hidden unit 4 to output weight of -2.556154)

(iv) ‘IF unit 2 is 1 THEN output is 2 (fitness 8.43)’, which corresponds to: ‘IF hair
colour=brown THEN result is not sunburned’ (input unit 2 to hidden unit 5 weighting of
2.288170 ∗ hidden unit 5 to output unit 2 weighting of 3.682286, with rounding)

(v) ‘IF unit 11 is 1 THEN output is 2 (fitness 10.12)’, which corresponds to ‘IF lotion
used THEN result is not sunburned’ (input unit 11 to hidden unit 5 weighting of
2.754642 ∗ hidden unit 5 to output unit 2 weighting of 3.682286, with rounding).

Figure 5 shows that, for the sunburnt cases (rules (i) – (iii) above), there is early
convergence (within three generations) to these rules. The fitness values cited in the rule
set above may not be the maximum attainable but are nevertheless significantly above 0.

Experiment 2

Another toy example was chosen from the machine learning literature, again,
only 8 records with four attributes (Figure 6).

7
Dataset

Run Supervisor Overtime Operator Output


1 Sally Yes Joe High
2 John No Samantha High
3 Sally Yes Joe High
4 John No Joe Low
5 Sally Yes Samantha High
6 Patrick No Samantha Low
7 Sally Yes Joe High
8 Patrick No Samantha Low

Figure 6: Second experimental dataset

The conversion between data and neural network representation was performed as before
(Figure 7).

Supervisor Sally 100


John 010
Patrick 001
Overtime Yes 10
No 01
Operator Joe 10
Samantha 01
Output High 10
Low 01

Figure 7: Conversion of second dataset into a neural network format

The rules involved in this classification are complex and there is some repetition so that
only very few records actually make a contribution to a rule. Symbolic algorithms do not
produce good results over this data set. See5 creates the ruleset:

IF overtime = Yes THEN output = High [0.833]


IF overtime = No THEN output = Low [0.667]

CN2 creates these single-condition rules, along with some dual condition rules:

IF supervisor = Sally THEN output = High [0 4]


IF supervisor = Patrick THEN output = Low [2 0]

where the numbers in brackets signifies how many cases of each class are captured by
that rule. For instance, ‘[0 4]’ after the first rule above signifies that this rules captures
none of the low output cases and 4 of the high output cases. The ANN with 7 input, 4
hidden and 2 output units was trained over a series of 1522 epochs to achieve a mean
squared error of 0.040. Below is the weight space for the network.

8
Hidden Unit 1 (all seven input to hidden connections)
-0.836101 -0.437469 -0.972496 -0.977659 0.265379 -0.459824 0.313158
Hidden Unit 2
-2.508566 -2.855611 1.858439 -1.711295 2.86410 2.675891 -1.834709
Hidden Unit 3
1.726850 0.421753 -0.725803 1.372710 -1.471043 0.338697 0.652326
Hidden Unit 4
-1.738682 -1.385388 2.255858 -0.626335 2.316902 0.007883 -3.285211

Output Unit 1 (all four hidden to output connections)


0.491153 -4.961958 2.423375 -2.589325
Output Unit 2
-0.687410 4.479441 -2.092269 3.477822

The genetic algorithm was started with a population of 10 and run for just 20 generations.
The top rules for each classification were as follows:

IF Supervisor = John THEN output = High (12.948)


IF Supervisor = Sally THEN output = High (10.966)
IF Operator = Samantha THEN output = High (7.847)

IF Overtime = No THEN output = Low (11.498)


IF Operator = Joe THEN output = Low (10.706)
IF Supervisor = Patrick THEN output = Low (7.120)

As before, the fitness measures for each rule are quoted to allow decisions to be made as
to the validity of each of the rules. As can be seen from the ruleset, the results from the
symbolic algorithms have largely been reproduced and the algorithm has also found some
extra rules.

Experiment 3

The dataset used was the mushroom dataset - a well-known collection of data
used for classifying mushrooms into an edible or poisonous class. The data contains 125
categories spanning 23 attributes.
As before, the data was converted into a neural network input format. The network was
first trained on this full dataset for 41 epochs and an error of 0.0161. However, the test
results from these runs were very poor and it prompted an investigation of the network
weights, revealing that the network was not learning successfully. Several solutions to
this problem were hypothesised and implemented with little success. The problem turned
out to be that the data set has a large number of unused categories and these were
translated along with the rest of the data, resulting in a network with a very sparse
distribution of information since over half of the categories were not present. These
categories were eliminated from the data and a smaller network with 30 hidden units was
trained on the smaller 62 category data set for 69 epochs. The error was higher than
before at 0.03 but testing was, on average, better. The genetic algorithm was run for 100

9
iterations with a population of 20. There were 7 operations per population, 4 crossover
and 3 mutation. The mutation rate was randomly set between –40 to +40. The rules
found by the GAs were encouragingly similar to those found by traditional algorithms,
but the system also supplemented the most obvious rules with some previously
undiscovered ones, exclusive to our approach:
IF odour=p THEN poisonous. (max 2.23) (found by CN2 and See5)
IF gill-size=n THEN poisonous. (max 1.13) (exclusive)
IF stalk-root = e THEN poisonous (max 1.13) (exclusive)
IF gill-size=b THEN edible. (max 2.3) (found by CN2)
IF odour=n THEN edible (max 1.58) (exclusive)
IF cap-surface=f THEN edible (max 1.58) (found by CN2)

The weightings specify maximum values since they surface frequently in the rule
list with different fitness values, depending on which hidden unit the input was connected
to. The rules correlate well with the ones found by traditional packages. In fact, they are
almost identical to the rules found by CN2. The exciting aspect here is that there are
some totally new rules extracted regarding each classification. The algorithms used in
traditional classification programs found only the odour=p rule for poisonous
classification, whereas our approach found two other rules.

The need to adapt the neural network to deal with a subset of the original data highlights
an inherent problem in any approach which attempts to integrate neural network learning
with symbolic rule extraction: The genetic algorithm can only generate rules from the
neural network if they already exist. If a network has not been trained properly on the
data set then the algorithm will not find the required associations. This means that users
must be very sure that the trained network is an accurate model of the domain they are
trying to mine. If this is not the case then the system will find spurious rules.

Discussion

Work is currently underway to amend the chromosome representation to extract two-


condition and multi-condition rules from the neural network trained on the mushroom
dataset, as well as to improve the behaviour of the trained neural network even further
when tested with examples not previously seen. It is an open question as to how well the
trained neural network has to perform on unseen examples before the process of rule
extraction can begin.

Together, the preliminary results reported here provide evidence of the feasibility of
integrating GAs with trained neural networks, both technically and in terms of efficiency.
The approach can be scaled up easily, with the major constraint on scale being the
accuracy of the trained neural network when dealing with large datasets. What was
particularly interesting was the extraction of rules not captured by traditional symbolic
learning techniques. While such rules may not be totally accurate in that they don’t
capture all or even most of the samples in a dataset, there is no doubt that the approach
outlined here can perform the useful function of extracting rules which lie at the
periphery of domain expertise or which capture exceptions (which can then be further
analysed to identify reasons for being exceptions). One of the major advantages of this

10
approach is that this is precisely what may be required in commercial applications of data
mining, where the task is not to mine the data to extract rules which are already known to
domain experts but to capture significant exceptions to general rules which then need
explaining in their own right for commercial advantage. The extraction of rules from the
neural network trained on the mushroom data set, where these rules were not captured by
symbolic data mining techniques, is therefore particularly significant, since it suggests
that the ability of neural networks to classify samples which cannot be classified by
symbolic means can now be tapped to produce intelligible rules which lie at the periphery
of domain expertise. In short, we claim that our approach utilises the best aspects of
neural network learning in noisy domains with the best aspects of symbolic rules through
the application of GAs.

There are a number of outstanding issues, all currently being worked on. (If this paper is
accepted for the Conference, the latest results using our approach will be described.) Our
system essentially finds a collection of paths (rules) through the trained network to
determine the optimal ones for a particular classification. It is certainly possible that one
input unit can exert both a negative and a positive influence over the same classification.
When fired, this unit could contribute in a large way towards the classification through
one hidden unit, but it might also have another set of heavily negative connections to
other hidden units which would negate that classification. In that case, the genetic
algorithm will find the large positive and negative connections and interpret their effect
separately, thereby creating erroneous and perhaps contradictory rules. In fact, for the
experiments listed above, there was a symmetry about the weights which reflected how
an input was classified. If the network determines that a certain attribute is not
contributing to a classification, it is far more likely to reduce the effect that that unit has
on the network rather than increase two sets of weights. This is largely how back-
propagation works, but it shows up a possible weakness in our approach if used on
networks which have been trained using a different learning algorithm from
backpropagation. Further experiments are required on ANNs of different types (e.g.
competitive, non-supervised learning networks) and different architectures (e.g. of more
than one hidden layer of neurons). The indications are that the system should be even
better suited to ANNs with larger numbers of hidden layers because, whilst the
complexity involved in extracting rules increases enormously, the complexity of the
genetic algorithm does not.

Bibliography

1. Andrews, R., Cable, R. Diederich, J., Geva, S., Golea, M., Hayward, R., Ho-Stuart, C.
and Tickle, A.B. (1996). An Evaluation And Comparison of Techniques For Extracting
and Refining Rules From Artificial Neural Networks. World Wide Web URL:
http://www.fit.qut.edu.au/NRC/ftpsite/QUTNRC-96-01-04.html

11
2. Arbatli, A.D. and Akin, H.L. (1997). Rule Extraction from Trained Neural Networks
Using Genetic Algorithms. Nonlinear Analysis, Theory, Methods and Applications. Vol
30. No. 3, pp 1639-1648

3. Towell, G. and Shavlik, J. (1993). The extraction of refined rules from knowledge
based neural networks. Machine Learning, 131, pp 71-101.

4. Winston, P. H. (1992). Artificial Intelligence (3rd Edition). Addison Wesley.


.

Acknowledgement

The research contained in this paper was funded in part by a grant from the Royal Mail.

12

You might also like