You are on page 1of 9

I

Entropy Nets: From Decision Trees to


Neural Networks

A multiple-layer artificial network (ANN) structure is capable of observation of input data. The input data representing a
implementing arbitrary input-output mappings. Similarly, hierar- pattern are called the measurement or feature vector. The
chical classifiers, more commonly known as decision trees, pos- function performed by a pattern recognition system i s the
sess the capabilites of generating arbitrarily complex decision
boundaries in an n-dimensional space. Given a decision tree, it i s mapping of the input feature vector into one of the various
possible to restructure it as a multilayered neural network. The decision classes. The mapping performed by a pattern rec-
objective of this paper i s to show how this mapping of decision ognition system can be represented in many cases by writ-
trees into a multilayer neural network structure can be exploited ing the equations of the decision boundaries in the feature
for the systematic design of a class of layered neural networks,
called entropy nets, that have far fewer connections. Several space. A linear input-output mapping realized by a partic-
important issues such as the automatic tree generation, incor- ular pattern recognition system then implies that the deci-
poration of incremental learning, and the generalizationof knowl- sion boundaries have a linear form. However, most of the
edge acquired during the treedesign phase are discussed. Finally, pattern recognition problems of practical interest need a
a two-step methodology for designing entropy networks i s pre- nonlinear mapping between the input and output. Although
sented. The advantages of this methodology are that it specifies
the number of neurons needed in each layer, alongwith thedesired a single neuron i s capable of only a linear mapping, a lay-
output. This leads to a faster progressive training procedure that ered network of neurons with multiple hidden layers pro-
allows each layer to be trained separately. Two examples are pre- videsanydesired mapping. It is thiscapabilityofthe layered
sented to show the success of neural network design through deci- networks that has resulted in the renewed interest in the
sion tree mapping.
ANN field.
An example of a multiple hidden layer network i s shown
I. INTRODUCTION
in Fig. l(a). Generally, all neurons in a layer are connected

. Artificial neural networks offer an exciting computational


paradigm for cognitive machines. The main attribute of the
ANN paradigm i s the distributed representation of the
knowledge in the form of connections between a very large
number of simple computing elements, called neurons.
These neurons are arranged in several distinct layers. The
(a) (b)
interfacing layer on the input side of the network is called
the sensory layer; the one on the output side i s the output Fig. 1. (a)Anexampleof a multiple hidden layer neural net-
work. (b)A typical neuron model. The triangular shape rep-
layer or the motor control layer. All the intermediate layers resents the summation. The inner dotted box represents the
are called hidden layers. All the computing elements may sigmoid activation function.
perform the same type of input-output operation, or dif-
ferent layers of computing elements may realize different
to all the neurons in the adjacent layers. The connection
kinds of input-output transfer functions. The reason for all
strength between two neurons from adjacent layers i s rep-
the excitement about ANN’S lies in their capability t o gen-
resented in the form of a weight value. The significance of
eralize input-output mapping from a limited set of training
this weight value is that it acts as a signal multiplier on the
examples. corresponding connection link. Each neuron in the layered
One important application area for ANN’S i s pattern rec-
network i s typically modeled as shown in Fig. l(b). As indi-
ognition. A pattern, in general, could be a segment of time-
cated in the figure, the input to a neuron i s the linear sum-
sampled speech, a moving target, or the profile of a pro-
mation of all the incoming signals on the various connec-
spective graduate student seeking admission. Pattern rec-
tion links. This net summation i s compared to a threshold
ognition implies initiating certain actions based on the
value, often called bias. The difference arising due to the
comparison drives an output function, usually called an
Manuscript received July 28, 1989; revised March 13, 1990. activation function, to produce a signal at the output line
The author i s with the Departmentof Computer Science,Wayne
State University, Detroit, MI 48202. of the neuron. The two most common choices for the acti-
IEEE Log Number 9039179. vation function are sigmoid and hyperbolic tangent func-

0018-9219/90/1000-1605$01.00 e 1990 IEEE

PROCEEDINGS OF T H E IEEE, VOL 78, NO 10, OCTOBER 1990 1605

___ ~
I

tions. In the context of pattern recognition, such layered ing mode i s more akin to human learning where the
networks are also called multilayer perceptron (MLP) net- hypotheses are continually refined in response to more and
works. It can be easily shown that two hidden layers are more training examples. However, the advantage of single-
sufficient to form piecewise linear decision boundaries of step learning i s that all the training examples are consid-
any complexity [I], [2]. However, it must be noted that two ered simultaneously to form hypotheses, thus leading to
layers are not necessary for arbitrary decision regions [3]. faster learning. The two significant differences between the
The first hidden layer is the partitioning layer that divides decision tree classifiers and the layered networks are: 1)the
the entire feature space into several regions. The second sequential nature of the tree classifiers as opposed to the
hidden layer i s the A w i n g layer that performs A m i n g of massive parallelism of the neural networks, and 2) the lim-
partitioned regions to yield convex decision regions for ited generalization capabilities of the tree classifiers as the
each class. The output layer can be considered as the ming learning mechanism in comparison to layered networks.
layerthat logically combinesthe resultsof the previous layer Theaimofthis paper istoshow howthesimilarities between
to produce disjoint decision regions of arbitrary shape with the tree classifiers and the layered networks can be used
holes and concavities if needed. for developing a pragmatic approach to the design and
The most common training paradigm for layered net- training of neural networks for classification. The motiva-
works is the paradigm of supervised learning. In this mode tion for this work is to provide a systematic layered network
of learning, the network i s presented with examples of design methodology that has built-in solutions to the credit
input-output mapping pairs. During the learning process, assignment problem and the network topology.
the network continuously modifies i t s connection strengths The organization of the rest of the paper is as follows. In
orweights to achieve the mapping present in the examples. Section II, I introduce decision trees and their mapping in
While the single-layer neuron training procedures have the form of layered networks. A recursive tree design pro-
been around for 30-40years, the extension of these training cedure i s introduced i n Section Ill to acquire knowledge
procedures to multilayer neuron networks proved to be a from the input data. Section IV discusses the issues related
difficult task because of the so-called credit assignment to incremental learning and the generalization of the cap-
problem, i.e., what should be the desired output of the neu- tured knowledge. This i s followed by a two-step layered net-
rons in the hidden layers during the training phase? work design procedure that allows the training of all the
One of the solutions to the credit assignment problem layers by progressively propagating the acquired knowl-
that has gained prominence is to propagate back the error edge. In Section V, I present experimental results that dem-
at the output layer to the internal layers. The resulting back- onstrate the speed of learning, as well as the extent of gen-
propagation algorithm [4] i s the most frequently used train- eralization possible through the present design approach.
ing procedure for layered networks. It i s a gradient descent Section VI contains the summary of the paper.
procedure that minimizes the error at the output layer.
II. DECISION
TREE
CLASSIFIERS
Although theconvergenceof thealgorithm has been proved
only under the assumption of infinitely small weight Thedecision treesoffera structured way of decision mak-
changes, the practical implementations with larger weight ing i n pattern recognition. The rationale for decision tree-
changes appear to yield convergence most of the time. based partitioning of the decision space has been well sum-
Because of the use of the gradient search procedure, the marized by Kana1 [9]. A decision tree is characterized by an
backpropagation algorithm occasionally leads to solutions order of set nodes. Each of the internal nodes is associated
that represent local minima. Recently, many variations of with a decision function of one or more features. The ter-
the backpropagation algorithm have been proposed t o minal nodes or leaf nodes of the decision tree are associ-
speed up the network learning time. Some other examples ated with actions or decisions that the system i s expected
of layered network training procedures are Boltzmann to make. In an rn-arydecision tree, there are rn descendants
learning [5], counterpropagation [6], and Madaline Rule-I1 for every node. Binary decision tree form i s the most com-
[;7. These training procedures, including the backpropa- monly used tree form. An equivalent binary tree exists for
gation algorithm, however, are generally slow. Addition- any rn-ary decision tree. Henceforth in this paper, a deci-
ally, these training procedures do not specify in any way sion tree will imply a binary tree.
the number o f neurons needed in the hidden layers, This A decision tree induces a hierarchical partitioning over
number is an important parameter that can significantly the decision space. Starting with the root node, each of the
affect the learning rate as well as the overall classification successive internal nodes partitions its associated decision
performance, as indicated by the experimental studies of region into two half spaces, with the nodedecision function
several researchers [I], [8]. defining the dividing hyperplane. An example of a decision
Thereexists aclassof conventional pattern classifiers that tree and the corresponding hierarchical partitioning
has many similarities with the layered networks. This class induced by the tree are shown in Fig. 2. It i s easy to see that
of classifiers is called hierarchical or decision tree classi- as the depth of the tree increases, the resulting partitioning
fiers. As the name implies, these classifiers arrive at a deci- becomes more and more complex.
sion through a heirarchy of stages. Unlike many conven- Classification using decision tree is performed by trav-
tional pattern recognition techniques, the decision tree ersing the tree from the root node to one of the leaf nodes
classifiers also do not impose any restriction on the under- using the unknown pattern vector. The response elicited
lying distribution of the input data. These classifiers are by the unknown pattern i s the class or decision label
capable of producing arbitrarily complex decision bound- attached to the leaf node that i s reached by the unknown
aries that can be learned from a set of trainingvectors. While vector. I t i s obvious that all the conditions along any par-
the tree-based learning i s noniterative or single step, the ticular path from the root to the leaf node of the decision
neural net learning is incremental. The incremental learn- tree must be satisfied in order to reach that particular leaf

1606 PROCEEDINGS OF THE IEEE, VOL. 78, NO. IO, OCTOBER 1990

1
1 The most important consequence of the tree-to-network
mapping i s that it defines exactly the number of neurons
neededin eachof thethreelayersof neural network,aswelI
as a way of specifying the desired response for each of these
neurons, as I shall show later. Hitherto, this number of neu-
rons has been determined by empirical means, and the
credit assignment problem has been tackled with back-
propagation. In comparison to the standard feedforward
(a) (b) layered networks that are fullyconnected, the mapped net-
Fig. 2. (a) An example of a decision tree. Square boxes rep- work has far fewer connections. Except for one neuron in
resent terminal nodes. (b) Hierarchical partitioning of the the partitioning layer that corresponds to the root node of
two-dimensional space induced by the decision tree of (a).
the decision tree, the remaining neurons do not have con-
nections with all the neurons in the adjacent layers. A fewer
node. Thus, each path of a decision tree implements an AND number of connections i s an important advantage from the
operation on a set of half spaces. If two or more leaf nodes VLSl implementation point of view [ I l l given the present
result in the same action or decision, then the correspond- state of technology. To emphasize this difference in the
ing paths are in an OK relationship. Since a layered neural architecture, I shall henceforth refer to the mapped net-
network for classification also implements ANDing O f hyper- work as the entropy net due to the mutual information-
planes followed by oKing in the output layer, it i s obvious based data-driven tree generation methodology that i s dis-
that a decision tree and a layered network are equivalent cussed in the next section. Such a methodology i s a must
in termsof input-output mapping. Not onlythat, adecision if the tree-to-network mapping i s to be exploited.
tree can be restructured as a layered network by following While the above mapping rules transform a sequential
certain rules. These rules can be informally stated as fol- decision making process into a parallel process, the result-
lows. ing network, however, has the same limitations that were
The number of neurons in the first layer of the layered
exhibited by the MADALINE [7] type of early multilayer ANN
network equals the number of internal nodes of the deci- models; there is no adaptability beyond the first hidden
sion tree. Each of these neurons implements one of the layer. In terms of the decision trees, this limitation i s best
decision functions of internal nodes. This layer i s the par- described by saying that once a wrong path i s taken at an
titioning layer. internal node, there i s no way of recovering from the mis-
All leaf nodes have acorresponding neuron in the sec- take. The layered networks of neurons avoid this pitfall
ond hidden layer where the ANDing i s implemented. This because of their adaptability beyond the partitioning layer
layer i s the Aming layer. that allows some corrective actions after the first hidden
The number of neurons in the output layer equals the layer. As I shall show later, it i s possible to have the same
numberof distinct classesor actions.This layer implements adaptability capabilities in the entropy network as those of
the ming of those tree paths that lead to the same action. neural networks obtained through the backpropagation
The connections between the neurons from the par-
training. This becomes possible by combining the dual con-
titioning layer and the neuronsfrom the ANDinglayer imple- cepts of soft decision making and incremental learning that
' ment the hierarchy of the-tree. are described after the next section or recursive tree design
An example of tree restructuring following the above procedure.
rules i s shown in Fig. 3 for the decision tree of Fig. 2. As this
111. MUTUAL
INFORMATION AND RECURSIVETREEDESIGN
I A I I Several automatic tree generation algorithms exist in the
pattern recognition literature where the problem of tree
generation has been dealt with in two distinct ways. Some
of the early approaches break the tree design process into
two stages. The first stage yields a set of prototypes for each
pattern class. These prototypes are viewed as entries in a
decision table which i s later converted into a decision tree
layer layer layer using some optimal criterion. Examples of this type of tree
Fig. 3. Three-layered mapped network for the decision tree
design approaches can be found in [12], [13]. The problem
of Fig. 2(a). of finding prototypes from binary or discrete-valued pat-
terns i s considered in [14], [15]. The other tree design
example shows, it i s fairly straightforward to map a decision approaches try toobtain the treedirectlyfrom thegiven set
tree into a layered network of neurons. It should be noted of labeled pattern vectors. These direct approaches can be
that the mapping rules given above do not attempt to considered as a generalization of decision table conversion
optimize the number of neurons in the partitioning layer. approaches, with all the available pattern vectors for the
However, a better mapping can be achieved incorporating design forming the decision table entries. Examples of these
checks in the mapping rules for replications of the node direct tree design approaches can be found in [16]-[19].
decision functions in different parts of the tree to avoid the There are three basic tasks that need to be solved during
duplication of the neurons in the partitioning layer. It can the tree design process: 1 ) defining the hierarchical order-
be further enhanced by using algorithms [IO] that produce ing and choice of the node decision functions, 2) deciding
an optimal tree from the partitioning specified by a given when to declare a node as terminal node, and 3) setting up
decision tree. a decision rule at each terminal node. The last task i s the

SETHI: ENTROPY NETS 1607


easiest part of the tree design process. It i s usually solved numerous researchers to automatically design decision
by following the majority rule. In its complete generality, trees in problems such as character recognition, target rec-
the decision tree design problem i s a difficult problem, and ognition, etc. The differences in the various algorithms per-
no optimal tree design procedure exists [20]. Some of the tain to the stopping criterion.
tree design difficulties are simplified in practice by enforc- The stopping criterion used in AMlG algorithm i s based
ing a binary decision based on a single feature at each of on the following inequality [21] that determines the lower
the nonterminal nodes. This results in the decision space limit on the average mutual information to be provided by
partitioning with the hyperplanes that are orthogonal to the the tree for the specified error performance P,:
featureaxes. Whilethe useof asinglefeaturedecision func-
tion at every nonterminal node reduces the computational
burden atthetreedesign time, it usuallyleadsto largertrees. where H( C) and H(P,), respectively, represent the pattern
One popular approach for ordering and locating the par- class and the error entropy. The criterion used in [I81 is to
titioning hyperplanes is based on defining agoodness mea- test the statistical significance of the mutual information
sure of partitioning in terms of mutual information. Con- gain that results from further splitting a node. Recently,
sider a two-class problem with only one measurement x. Let Goodman and Smyth [22] havederived several fundamental
x = t define the partitioning of the one-dimensional feature bounds for mutual information-based recursive treedesign
space. If we view the measurement x taking on values procedures, and have suggested a new stopping criterion
greater or less than threshold t as two outcomes x, and x2 which isclaimed to be more robust in the presenceof noise.
of an event X, then the amount of average mutual infor- Instead of using a stopping criterion to terminate the
mation obtained about the pattern classes from the obser- recursive partitioning, Breiman et a/. [I91 use a pruning
vation of even x can be written as approach with the Gini criterion to design decision trees.
2 2
In their approach, the recursive partitioning continues until
K; X ) = c
r=l,=1
pk,, x/) log, [p(c,lxI) I p(c,)l (1)
the tree becomes very large. This tree i s then selectively
pruned upward3 to find a best subtree having the lowest
error estimate. Trees obtained using pruning are typically
where C represents the set of pattern classes and the p(.)'s
less biased towards the design samples.
arethevarious probabilities. Clearly, for better recognition,
Summarizing the above discussion, it i s clear that there
the choice of the threshold t should be such that we get as
exist several automatic tree generation procedures that are
much information as possible from the event X. This means
driven by the example pattern vectors. Any of these pro-
that the value which maximizes (1) should be selected over
cedures in conjunction with the decision tree-to-network
all possible values of t. Average mutual information gain
mapping discussed earlier can be used todesign an entropy
(AMIG) thus provides a basis for measuring the goodness
network for a given pattern recognition problem. While the
of a partitioning.
design of layered networks through decision tree mapping
Another popular criterion for partitioning i s the Gini
eliminates the guesswork about the number of neurons in
index of diversity [19].In this criterion, the impurityof a set
different layers and provides a direct method of obtaining
of observations as a partitioning stage s is defined as
connection strengths, the problem of adaptability of the
entropy network beyond the partitioning layer still remains.
The solution to this i s discussed in the next section.
where p(c,(s)denotes the conditional probability. The fur-
LEARNINGAND GENERALIZATION
IV. INCREMENTAL
ther split in the data is made by selecting a partitioning that
yields the greatest reduction in the average data impurity. Incremental learning implies modifying the existing
The advantage of this criterion is its simpler arithmetic. knowledge in response to new data or facts. In the context
The above or similar partitioning measures immediately of neural networks, it means the ability to modify the con-
suggest a top-down recursive procedure for the tree design. nection strengths or weights in response to sequential pre-
The AMlG (average mutual information gain) algorithm [ I 7 sentation of input-output mapping examples. While the
i s one such example of the recursive tree design procedure tree design phase can be viewed as an inductive learning
that seeks to maximize the amount of mutual information phase and even the tree design process can be made incre-
gain at every stage of tree development. Unlike many other mental in a limited sense [23], it is essential to have incre-
algorithms that either operate on discrete features or two mental learning capability in the mapped networks. Such
classes, the AMlG algorithm i s capable of generating deci- a capability is needed not only for the obvious reasons of
sion trees for continuous-valued multifeature, multiclass adaptability and compatibility with networks designed
pattern recognition problems from a set of labeled pattern through other approaches, but also for reducing the stor-
vectors. The AMlG algorithm at any stage of tree devel- age demands on the batch-oriented recursive tree design
opment essentially employs a brute force search technique procedures. With the incremental learning capability in the
to determine the best feature for that stage, along with its entropy network, it i s possible to divide the task of knowl-
best threshold value to define an event for the correspond- edge acquisition over the processes of tree building and
ing node. Since the orientation of dividing hyperplanes is mapped network training. Using only a small representa-
restricted, i.e., onlyone feature i s used at any internal node, tive subset of the available input-output mapping exam-
the search space for maximizing the average mutual infor- ples, a decision tree can be designed without putting too
mation gain i s small. The search i s made efficient by order- much storage demands during the recursive tree genera-
ing the labeled patterns along different feature axes to tion phase. After the mapping has been done, the remain-
obtain a small set of possible candidate locations along each ing examples can be used to further train the network in
axis. The AMlG algorithm or its variants have been used by an incremental fashion.

PROCEEDINGS OF THE IEEE, VOL. 78, NO. IO, OCTOBER 1990

I
To have the ability to modify weights in response to train- decision, the activation function associated with the neu-
ing examples, it i s essential to solve the credit assignment rons in the entropy network can be considered as a relay
problem for the intermediate layers, i.e., the partitioning function. Obviously, it i s not good for generalization, and
layer and the A w i n g layer. Fortunately, in the entropy net- must be replaced by sigmoid or some other soft nonlinear
work, this problem is automatically solved during the tree function. The signal level effect of having sigmoid nonlin-
design stage when different paths are assigned class labels. earity in place of relay nonlinearity i s that it changes all the
As can be noticed from the tree-to-network mapping, there internal signals from binary to analog. Consequently, small
exists agroup of neurons for every pattern class in the AND- changes in the features do not affect the response of the
ing layer of the network. The membership in this group i s neurons in different layers as much as compared to the
known from the tree-to-network mapping. Thus, given an binary signal case where a small change can result in acom-
example pattern from class c, it i s known that onlyone neu- plete reversal of the signal state. This enhances the capa-
ron from the group c of the ANDing layer neuron should bility of the entropy network to deal with problems having
produce an output of "1," while the remaining neurons noise and variability in the patterns, thus leading to better
from that group as well as those from the other groups generalization. Another consequence of sigmoid nonlin-
should produce a "0" response. Therefore, the solution to earity is that it altogether eliminates or minimizes the need
the credit assignment problem for the ANDing layer i s very for training the partitioning layer in the incremental learn-
simple: enhance the response of the neuron producing the ing mode. With the relay type of hard nonlinearity, it may
highest output among the neurons from group c, and sup- not be possible to converge to the proper weights in the
press the response of the remaining neurons in the ANDing ANDing layer for the desired response if the threshold val-
layer for a pattern from class c. This i s similar to the winner- ues in the partitioning layer, determined during the tree
take-allapproach followed for the neural net training in the design phase, are not proper. In such cases, the partitioning
self-organizing mode [24]. The reason that this simple layer training must adjust these threshold weights. How-
approach works in the entropy network i s that the network ever, with sigmoid activation function, the thresholds in the
has a built-in hierarchy of the decision tree which i s not partitioning layers can be off to a reasonable extent without
present in the other layered networks. Once the identity affecting the convergence of the weight values for the AND.
of the firing neuron i s the A w i n g layer i s established for a ing layer. This i s very important as it eliminates altogether
given example pattern, the desired response from the par- any reference to the previous layers during the learning
titioning layer neurons i s also established because of tree- process and provides a direct layer-by-layer progressive
to-network mapping. Hence, the tree-to-network mapping propagation method for learning.
not only provides the architecture of the multilayer net, but Another way of looking at the use of a sigmoid function
it also solves the credit assignment problem, thus giving i s that it allows the actual feature values to be carried across
rise to an incremental learning capability for the entropy different layers in a coded form, while the hard nonlin-
network. Since the initial network configuration itself pro- earities lose the actual feature value at the partitioning layer
vides a reasonably good starting solution to the various net- itself. In addition to the generality o r the better decision
work parameters, the incremental learning can be very fast, making that results from carrying through the actual fea-
leading to drastically reduced training time on the whole. ture values in coded form, the other very important con-
One very important characteristic of learning, whether sequence of carrying through the coded information i s that
incremental or not, i s that it should lead to generalization the final decision boundaries need not be piecewise linear,
capability on the part of the network. The amount of gen- as would be the case with hard nonlinearities. Moreover,
eralization achieved i s reflected in the network response for the linear boundaries, the orientation need not be par-
to those patterns that did not form part of the input-output allel to different feature axes. Thus, while the partitioning
mapping examples. To put it in other words, how well the layer neurons get information about single features only,
network interpolates among the examples shown deter- the neurons in the successive layers do receive information
mines the degree of generalization achieved. In a typical on many features, thereby producing boundaries of desired
neural net, the generalization i s achieved by incorporating shape and orientation.
nonlinearities in the neurons. This i s done by choosing a Based on the discussion thus far, the following steps are
nonlinear activation function for the neuron model. While suggested for designing entropy nets of pattern recogni-
the choice of a particular nonlinearity i s not crucial for tion tasks.
learning a specific set of input-output mapping examples, Divide the available set of input-output mapping
it does determine the amount of generalization that the net- examples in two parts: tree design set and network training
work will achieve [25]. The soft nonlinearities, such as the set. This should be done when a large number of input-out-
sigmoid function, provide much better generalization com- put examples i s available. Otherwise, the complete set of
pared to the relay type of hard nonlinearities. An intuitive examples should be used for tree design and network train-
understanding ofwhythe soft nonlinearities provide better ing.
generalization in a parallel environment like the multilayer * Using AMlC or a similar recursive tree design pro-

neural networks can be had by saying that these types of cedure, develop a decision tree for the given problem.
nonlinearities allows the decision making to be delayed as Map the tree into a three-layer neural net structure fol-
far as possible in the hope that at the later layers, more infor- lowing the rules given earlier.
mation will be available to make a better decision. Hard Associate the sigmoid or some other soft nonlinearity
nonlinearities, on the other hand, do not provide this priv- with every neuron. Train the ANDing and oRing layers of the
ilegeof postponing decisions, and consequentlydo not lead entropy network using the network training subset of the
to much generalization in the network. input-output mapping examples and the following pro-
Since each node of the decision tree produces a binary cedure for determing the weight change.

SETHI. ENTROPY NETS 1609

---
Let x ( p ) with category label L ( x ( p ) )be the input pattern parameter p i s called the learning factor that may or may
to the entropy network at the p t h presentation during the not remain fixed over the entire training. It controls the
training. Let R,(x(p))denote the response of t h e j t h neuron amount of correction that i s applied to determine new
from the ANDing/oRing layer. Let G( j ) represent the group weight values. The third parameter E specifies the termi-
membership of the j t h neuron and w,, the connection nation of the training procedure. The training is terminated
strength between the j t h neuron and the i t h neuron of the if none of the weight components differs from its previous
previous layer. Then value by an amount greater than E.

and The first design example i s an analog version of the EX-


OR problem. Fig. 5(a) shows two pattern classes in the form
of two different tones in a two-dimensional feature space,
with the corresponding decision tree in Fig. 5(b). The tonal
where the amount of change i n the weights i s determined
by the Window-Hoff procedure [26] or the L M S rule, as it
i s called many times. The term mjji s either “1” or “0,” indi-
cating whether a connection exists to the j t h neuron from
the i t h neuron or not. It should be noted that the presence
or the absence of the connections i s determined at the time
of tree-to-network mapping. The suggested training pro-
cedure i s such that it i s possible to train each layer sepa-
rately or simultaneouly.
The above process of layered network design can be con-
sidered as a two-stage learning process. In the first stage,
major aspects of the given problem are learned by simul-
AXILO50

taneously considering a large number of input-output


mapping examples. Next, the refinement of the learned
knowledge as well as i t s generalization are achieved by
looking at the same or additional examples in isolation.

V. DESIGN
EXAMPLES
Following the above design approach for the layered net- (b)
works, I present i n this section two design examples. The Fig. 5. (a) An analog version of EN-OR problem in a two-
first example i s for a two-category pattern recognition task dimensional space. Two different tones represent two dif-
with only two features. The second example i s for a mul- ferent class regions that a neural network is expected to
ticategory, multifeature pattern recognition task involving learn. The dots correspond to input-output examples used
for network training. (b) Decision tree for the analog EX-OR
waveform classification. The purpose of the first example
problem.
i s to bring out the generalization capability of the entropy
network. The second example i s presented to demonstrate
and compare the efficacy of the methodology for multi- boundary i s the decision boundary that the neural net is
category problems in higher dimensional space. expected to learn. Entropy network mapping the treeof Fig.
There are three important parameters in the entropy net 5(b) is shown in Fig. 6. The threshold values of all the inter-
training. One of these is called the generalization constant
a that determines the generalization capability of the
entropy network. The parameter a controls the linear part
of the sigmoid nonlinearity. Fig. 4 shows several plots of the
sigmoid function for different values of a. Since a very large
a value brings the sigmoid nonlinearity very close to the
relay nonlinearity, the amount of generalization provided
by a i s inversely proportional to i t s value. The second

Fig. 6. Entropy network for the decision tree of Fig. 5(b).


Tones in the network represent the class responsible for the
corresponding neuron firing.

nal nodes of the tree of Fig. 5(b) were intentionally offset


by an amount of 0.05 in the mapping process to determine
theadatabilityof theentropy network. Usingauniform ran-
0.5 1. dom number of generator, 60 input-output mapping exam-
Fig. 4. Sigmoid function l ( ( 1 + exp (-ax)) plots for differ- ples for this problem were generated. The dots in Fig. 5(a)
ent (Y values. represent these randomly generated input-output map-

1610 PROCEEDINGS OF THE IEEE, VOL. 78, NO. IO, OCTOBER 1990

-
~~

I
ping examples that were used to train the entropy network. does not mimic the decision tree, but has i t s own gener-
While it i s possible t o train the A w i n g and oRing layers alization capability.
simultaneously, each layer was trained separately to deter- I n orderto furthertestthegeneralization capabilityof the
mine the progressive generalization capability of the network, the class labels of the training examples were
entropy network. The parametersp and e were set to 1.0 and changed to correspond t o the decision boundary of Fig. 9,
0.001, respectively, and different values for the generali-
zation parameter CY were used. The learning factor was made
to decrease in inverse proportion to the iteration number.
Results for two cases of training are shown i n Figs. 7 and
8. The upper left image in each of these figures represents

Fig. 9. Another decision boundary for learning by the


entropy net of Fig. 6.

and the same network was trained with these modified


examples. Training results for this case are shown in Figs.
10 and 11 for two values of C Y , 1, and 5, respectively. Since

Fig. 7. Mapping learned by the entropy net for CY = 20. The


upper left image is the mapping learned at the output layer.
The upper right image is the mapping learned at the ANDing
layer. The lower left and right images show the difference
in the desired mapping and the learned mapping.

Fig. 10. Mapping learned by the entropy net with modified


input-output examples. CY = 1.

Fig. 8. Same as in Fig. 7, except CY = IO.

the decision boundary as learned by the net at its output


layer. The lower left image shows the difference in the actual
decision boundary and the learned boundary. The right-
hand column represents the same for the ANDing layer. The Fig. 1 1 . Same as Fig. 10, except CY = 5.
decision boundary at the partitioning layer, of course, cor-
responds to the boundary represented by the decision tree.
Fig. 7 represents the generalization performed by the net there i s large error in the decision tree boundary of Fig. 5(b)
for an cy value of 20.0, while Fig. 8 corresponds to a value and Fig. 9, a large amount of generalization i s needed in this
of 10.0. Since thethreshold values in thedecision treewere particular case to let the entropy net adapt, as i s evident in
offset by a small amount, the generalization needed is not Figs. IO and 11. It i s important to note here that the final
large. This explains the better learning exhibited in Fig. 7 decision boundary i n this case has an orientation other than
compared to Fig. 8. This experiment has been repeated the horizontal or vertical. This indicates that while the deci-
many times with different seed values for the random gen- sion tree, designed with a single feature per node, has con-
erator program. In all of the cases, results obtained were straints on the orientation of the decision boundary, the
almost similar to the above. The number of iterations for entropy network has no such limitation due to the sigmoid
the Aming layer averaged about 37. The corresponding function. While examining the output of the ANDing layer
number for the oRing layer is about 156. The notable feature neurons, once again, the build up of the internal repre-
of the entire training procedure was that only one neuron sentation by the entropy network was observed; only one
in the ANoing layer always responded for the pattern class neuron per class was found to participate in the learning
represented by the lighter shade, although two neurons process. The remaining two neurons, one from each cat-
were put in the Arming layer for each class as a result of the egory, were found in the nonfiring state all the time. The
tree mapping. This indicates that the entropy network just pairing of firing and nonfiring neurons was not fixed; it was

SEJHI: ENTROPY NETS 1611


I

found to depend on the starting random weight values. The The initial choice for the weights was made randomly. The
average number of iterations for the Atwing layer over the training procedure was repeated many times with different
different runs was 193. The corresponding number for the initial weight values. No significant differences, either in
orzing layer was 42. The larger number of iterations for the terms of the number of iterations or the error rate, were
ANDing layer in this case is due to the great difference in the noticed due to initial weight selection. In all of the cases,
starting boundary that corresponds to the tree of Fig. 5(b) stable classification performance was attained within 40
and the desired boundary of Fig. 9. iterations.The best classification performance was obtained
The second example uses a synthetic data set to simulate for an (Y value of 2.0. Next, a number of classification exper-
awell-known waveform recognition problem from the clas- iments were performed to gauge the effectiveness of the
sification tree literature [19]. This was done to compare the entropy net. These experiments include entropy net train-
classification performance of the entropy net to several ing using the backpropagation program of the PDP soft-
other classifiers. The WAVE data consist of 21-dimensional ware [27. In this case, the learning rate of 0.5 was found
continuous-valued feature vectors coming from three unsuitable in terms of the number of iterations and clas-
classes with equal a priori probability. Each class of data is sification performance. However, using the learning rate of
generated bycombining itwith noisetwoof thethreewave- 0.1 resulted in stable performance with 40 iterations.
forms of Fig. 12 at 21 sampled positions (see [I91for details). Another entropy net mapping thedecision tree given in [I91
for the same problem was also realized and trained.
Fig. 14 summarizes the classification performance of sev-

Entropy CART
Enlropy BP

Entropy AMlG
Fig. 12. Three basic functions to generate waveform data.
1-NN

CART Tree
AMlG Tree
Thetraining dataset consistsof 300examplesthatwere used
to design the tree. The same set of examples was then used 0 0.05 0.1 0.15 0.2 0 . 2 5 0.3
to train the entropy net. The test data have 5000 vectors. Error Rate
Using the training vectors, the AMlG algorithm produced Fig. 14. Error rate for different classifiers for the waveform
the decision tree of Fig. 13 for waveform classification. The recognition problem. Entropy.BP represents the results
when the entropy net for the AMlG tree was trained using
backpropagation.

era1classifiersand theentropy net for differentcases. These


classifiers include the decison tree of Fig. 13, the decision
tree from [19], and a nearest neighbor classifier that uses
the training set as its database. It i s seen that the entropy
net, whether trained using the LMS rule or backpropaga-
tion, provides an improvement over the tree classifier per-
formance because of the adaptability arising due to the use
of sigmoid nonlinearity. In all three realizations of the
entropy net, the performance i s either better than the near-
est neighbor performance or it is almost the same. The rel-
ative performance levels attained by different classifiers are
similar to other studies for different classification tasks [8].
Fig. 13. Decisiontreeforthewaveform datausingtheAMlC The shorter training time for the entropy net using the L M S
algorithm.Thefirst number at each internal node represents ruleor backpropagation alsoconfirmsthe findings of other
a feature axis, and the second number corresponds to a researchers that matching the network structure to the
threshold value on that axis. problem leads to less training time [I], [8].

VI. SUMMARY AND CONCLUSIONS


two numbers within each internal node of the tree respec-
tively represent the feature axis and the threshold value on A new neural networkdesign methodology has been pre-
that axis. There are only 8 features out of 21 features that sented in this paper. This methodology has been developed
are presented in the tree. This indicates that the tree has by exploiting the similarities between the hierarchical clas-
already acquired the ability to discriminate between the sifiers of the traditional pattern recognition literature and
important and nonimportant features of the problem as far the multiple-layer neural networks. It has been shown that
as the classification task is concerned. the decision trees can be restructured as three-layer neural
To determine the learning progress of the entropy net, networks, called entropy networks. The entropy network
it was decided to perform classification on the test data after architecture has the advantage of relatively fewer neural
every ten iterations of weight adjustment with the training connections, which i s an attractive feature from the VLSl
data and use the error rate on the test data as a measure fabrication point of view. Since it i s possible to automati-
of learning. Both of the layers were trained simultaneously. cally generate decison trees using data-driven procedures,

1612 PROCEEDINGS OF THE IEEE, VOL. 78, NO. 10, OCTOBER 1990
L

the tree-to-layered-network mapping rules provide a sys- [ l l ] L. A. Akers, M. R. Walker, D. K. Ferry, and R. 0. Grodin,"Lim-
tematic tool t o obtain layered network architecture. One ited interconnectivity in synthetic neural systems," in R. Eck-
miller and C. v.d. Malsburg, Eds., Neural Computers. New
very important property of the entropy network architec- York: Springer-Verlag, 1988.
ture i s that the problem of credit assignment does not exist 121 C. R. P. Hartmann, P. K. Varshney, K. C. Mehrotra, and C . L.
for these networks as it i s automatically solved during the Cerberich, "Application of information theory to the con-
tree learning stage. The issues of incremental learning and struction of efficient decision trees," /€€€ Trans. Inform. The-
generalization have been discussed, and the importance of ory, vOI. IT-28, pp. 565-577, July 1982.
131 I. K. Sethi and B. Chaterjee,"Efficient decision treedesign for
soft nonlinearities has been stressed. Finally, a two-stage discrete variable pattern recognition problems," PatternRec-
procedure has been given where the dominant aspects of ognition, vol. 9, pp. 197-206, 1978.
the problem are learned during the treedevelopment phase 141 -, "A learning classifier scheme for discrete variable pat-
and the generalization of the learned knowledge takes place tern recognition problems,'' /€€€ Trans. Syst., Man, Cybern.,
via the entropy network training. The effectiveness of the vol. SMC-8, pp. 49-52, Jan.1978.
151 J. C. Stoffel, "A classifer design technique for discrete vari-
proposed methodology has been shown through two able pattern recognition problems," I€€€Trans. Cornput., vol.
examples. C-23, pp. 428-441, 1974.
It needs to be mentioned that the tree-to-network map- 161 E. G . Henrichon and K. S. Fu, "A nonparametric partitioning
ping approach i s not without any flaws. Because of the use procedure for pattern classification," / € € E Trans. Comput.,
vol. C-18, pp. 614-624, July 1962.
of a single feature at each node during the tree design, it 171 I. K. Sethi and G. P. R. Sarvarayudu, "Hierarchical classifier
i s possible in many cases, the EX-OR problem for one, t o end design using mutual information," I€€€ Trans. Pattern Anal.
up with very large trees. One possible solution to avoid very Machine Intell., vol. PAMI-4, pp. 441-445, July 1982.
large trees i s to apply the Hotelling or principal component 181 J.L. Talmon, "A multiclass nonparametric partitioning algo-
rithm," in E. S. Celsema and L. N. Kanal, Eds., Pattern Rec-
transformation [28]t o the data first. In terms of the entropy ognition in Practice I / . Amsterdam: Elsevier Science Pub. B.
network, it i s equivalent t o adding an extra representation V. (North-Holland). 1986.
stage giving rise to a network which can be appropriately L. Breiman, J. Friedman, R. Olshen, and C. J. Stone, Classi-
called a hoteling-entropy net. Such networks are currently ficationandRegressionTrees. Belmont, CA: Wadsworth Int.
under study, along with a study on the limitations of par- Group, 1984.
L. Hyafil and R. L. Rivest, "Constructing optimal binary deci-
tially connected networks in comparison to fully connected sion trees is NP-complete," Inform. Processing Lett., vol. 5,
feedforward networks with respect to missing data and bro- pp. 15-17,1976.
ken connections. R. M. Fano, Transmission of Information. New York: Wiley,
1963.
R. M. Goodman and P. Smyth, "Decision tree design from a
ACKNOWLEDGMENT communication theory standpoint," /E€€ Trans. Inform. The-
I gratefully acknowledge the assistance of N. Ramesh, ory, vol. 34, pp. 979-994, Sept. 1988.
J. R. Quinlan, "Induction of decision trees," Machine Learn-
M. Otten, and G.Yu in running some experiments for me. ing, vol. 1, pp. 81-106, 1986.
I also thank Prof. A. Jain for many useful discussions. T. Kohonen, Self-organization and Associative Memory.
Berlin: Springer-Verlag, 1984.
REFEREN cEs C. J. MatheusandW. E. Hohensee,"Learning inartifical neural
systems," Univ. of Illnois, Urbana,Tech. Rep.TR-87-1394,1987.
D. J. Burr, "Experiments on neural net recognition of spoken B. Widrow and M. E. Hoff, "Adaptive switching circuits," in
and written text," /E€€ Trans. Acoust, Speech, Signal Pro- 7960 IRE WESCON Conv. Rec., part 4, 1960, pp. 96-104.
cessing, vol. 36, pp. 1162-1168, July 1988. J.L. McClelland and D. E. Rumelhart, Explorationsin Parallel
R. P. Lippmann, "An introduction to computing with neural Distributed Processing. Cambridge, MA: M.I.T. Press, 1988.
nets," lEEEASSPMag., pp. 4-22, Apr. 1987. R. 0. Duda and P. E. Hart, Pattern Classification and Scene
A. Wieland and R. Leighton, "Geometric analysis of neural Analysis. New York: Wiley, 1973.
network capabilities," in Proc. /E€€ Int. Conf. Neural Net-
works, Vol. ///, San Diego, CA, June1987, pp. 385-392.
D. E. Rumelhart, C. E. Hinton, and R. J. Williams, "Learning
internal representation by error propagation," in D. E. Rurnel-
hart and J. L. McClelland, Eds., Parallel Distributed Process-
ing: Explorations in the Microstructure of Cognition. Vol. 7: lshwar K. Sethi (Senior Member, IEEE)
Foundations. Cambridge, MA: M.I.T. Press, 1986. received the B.Tech (Hons.), M. Tech., and
D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, "A learning Ph.D. degrees in Electronics and Electrical
algorithm for Boltzmann machines," Cognitive Sci., vol. 9, Communication Engineering from the
pp. 147-169, 1985. Indian Institute of Technology, Kharagpur,
R. Hecht-Nielsen, "Counterpropagation networks," Appl. India, in 1969, 1971, and 1977 respectively.
Opt., vol. 26, pp. 4979-4984, Dec. 1987. He is presently an Associate Professorof
B. Widrow, R. G. Winter, and R. A. Baxter, "Layered neural Computer Science at Wayne State Univer-
nets for pattern recognition," /€€E Trans. Acoust., Speech, sity, Detroit, Michigan. His current research
Signal Processing, vol. 36, pp. 1109-1118, July 1988. interestsare in the applications of artificial
W. Y. Huang and R. P. Lippmann, "Comparison between neural networks to solve computer vision
neural net and conventional classifers," in Proc. / € € E 7st lnt. and pattern recognition problems. Prior to joining Wayne State
Conf. Neural Networks, Vol. IV, San Diego, CA, June1987, pp. University, he was on the faculty at the Indian Institute of Tech-
485-493. nology, Kharagpur, India. During the latter half of 1988, he was a
L. N. Kanal, "Problem-solving models and search strategies Visiting Professoratthe Indian InstituteofTechnology, New Delhi,
for pattern recognition," / € E € Trans. Pattern Anal. Machine India.
Intell., vol. PAMI-1, pp. 194-201, Apr. 1979. Dr. Sethi i s a member of the Association for Computing
H. J. Payne and W. S. Meisel, "An algorithm for constructing Machinery, the International Neural Network Society, and the
optimal binary decision trees," /€€E Trans. Cornput., vol. C-25, International Society for Optical Engineering. He i s currently an
pp. 905-916, Sept. 1977. Associate Editor of Pattern Recognition.

SETHI: ENTROPY NETS 1613

You might also like