Professional Documents
Culture Documents
A multiple-layer artificial network (ANN) structure is capable of observation of input data. The input data representing a
implementing arbitrary input-output mappings. Similarly, hierar- pattern are called the measurement or feature vector. The
chical classifiers, more commonly known as decision trees, pos- function performed by a pattern recognition system i s the
sess the capabilites of generating arbitrarily complex decision
boundaries in an n-dimensional space. Given a decision tree, it i s mapping of the input feature vector into one of the various
possible to restructure it as a multilayered neural network. The decision classes. The mapping performed by a pattern rec-
objective of this paper i s to show how this mapping of decision ognition system can be represented in many cases by writ-
trees into a multilayer neural network structure can be exploited ing the equations of the decision boundaries in the feature
for the systematic design of a class of layered neural networks,
called entropy nets, that have far fewer connections. Several space. A linear input-output mapping realized by a partic-
important issues such as the automatic tree generation, incor- ular pattern recognition system then implies that the deci-
poration of incremental learning, and the generalizationof knowl- sion boundaries have a linear form. However, most of the
edge acquired during the treedesign phase are discussed. Finally, pattern recognition problems of practical interest need a
a two-step methodology for designing entropy networks i s pre- nonlinear mapping between the input and output. Although
sented. The advantages of this methodology are that it specifies
the number of neurons needed in each layer, alongwith thedesired a single neuron i s capable of only a linear mapping, a lay-
output. This leads to a faster progressive training procedure that ered network of neurons with multiple hidden layers pro-
allows each layer to be trained separately. Two examples are pre- videsanydesired mapping. It is thiscapabilityofthe layered
sented to show the success of neural network design through deci- networks that has resulted in the renewed interest in the
sion tree mapping.
ANN field.
An example of a multiple hidden layer network i s shown
I. INTRODUCTION
in Fig. l(a). Generally, all neurons in a layer are connected
___ ~
I
tions. In the context of pattern recognition, such layered ing mode i s more akin to human learning where the
networks are also called multilayer perceptron (MLP) net- hypotheses are continually refined in response to more and
works. It can be easily shown that two hidden layers are more training examples. However, the advantage of single-
sufficient to form piecewise linear decision boundaries of step learning i s that all the training examples are consid-
any complexity [I], [2]. However, it must be noted that two ered simultaneously to form hypotheses, thus leading to
layers are not necessary for arbitrary decision regions [3]. faster learning. The two significant differences between the
The first hidden layer is the partitioning layer that divides decision tree classifiers and the layered networks are: 1)the
the entire feature space into several regions. The second sequential nature of the tree classifiers as opposed to the
hidden layer i s the A w i n g layer that performs A m i n g of massive parallelism of the neural networks, and 2) the lim-
partitioned regions to yield convex decision regions for ited generalization capabilities of the tree classifiers as the
each class. The output layer can be considered as the ming learning mechanism in comparison to layered networks.
layerthat logically combinesthe resultsof the previous layer Theaimofthis paper istoshow howthesimilarities between
to produce disjoint decision regions of arbitrary shape with the tree classifiers and the layered networks can be used
holes and concavities if needed. for developing a pragmatic approach to the design and
The most common training paradigm for layered net- training of neural networks for classification. The motiva-
works is the paradigm of supervised learning. In this mode tion for this work is to provide a systematic layered network
of learning, the network i s presented with examples of design methodology that has built-in solutions to the credit
input-output mapping pairs. During the learning process, assignment problem and the network topology.
the network continuously modifies i t s connection strengths The organization of the rest of the paper is as follows. In
orweights to achieve the mapping present in the examples. Section II, I introduce decision trees and their mapping in
While the single-layer neuron training procedures have the form of layered networks. A recursive tree design pro-
been around for 30-40years, the extension of these training cedure i s introduced i n Section Ill to acquire knowledge
procedures to multilayer neuron networks proved to be a from the input data. Section IV discusses the issues related
difficult task because of the so-called credit assignment to incremental learning and the generalization of the cap-
problem, i.e., what should be the desired output of the neu- tured knowledge. This i s followed by a two-step layered net-
rons in the hidden layers during the training phase? work design procedure that allows the training of all the
One of the solutions to the credit assignment problem layers by progressively propagating the acquired knowl-
that has gained prominence is to propagate back the error edge. In Section V, I present experimental results that dem-
at the output layer to the internal layers. The resulting back- onstrate the speed of learning, as well as the extent of gen-
propagation algorithm [4] i s the most frequently used train- eralization possible through the present design approach.
ing procedure for layered networks. It i s a gradient descent Section VI contains the summary of the paper.
procedure that minimizes the error at the output layer.
II. DECISION
TREE
CLASSIFIERS
Although theconvergenceof thealgorithm has been proved
only under the assumption of infinitely small weight Thedecision treesoffera structured way of decision mak-
changes, the practical implementations with larger weight ing i n pattern recognition. The rationale for decision tree-
changes appear to yield convergence most of the time. based partitioning of the decision space has been well sum-
Because of the use of the gradient search procedure, the marized by Kana1 [9]. A decision tree is characterized by an
backpropagation algorithm occasionally leads to solutions order of set nodes. Each of the internal nodes is associated
that represent local minima. Recently, many variations of with a decision function of one or more features. The ter-
the backpropagation algorithm have been proposed t o minal nodes or leaf nodes of the decision tree are associ-
speed up the network learning time. Some other examples ated with actions or decisions that the system i s expected
of layered network training procedures are Boltzmann to make. In an rn-arydecision tree, there are rn descendants
learning [5], counterpropagation [6], and Madaline Rule-I1 for every node. Binary decision tree form i s the most com-
[;7. These training procedures, including the backpropa- monly used tree form. An equivalent binary tree exists for
gation algorithm, however, are generally slow. Addition- any rn-ary decision tree. Henceforth in this paper, a deci-
ally, these training procedures do not specify in any way sion tree will imply a binary tree.
the number o f neurons needed in the hidden layers, This A decision tree induces a hierarchical partitioning over
number is an important parameter that can significantly the decision space. Starting with the root node, each of the
affect the learning rate as well as the overall classification successive internal nodes partitions its associated decision
performance, as indicated by the experimental studies of region into two half spaces, with the nodedecision function
several researchers [I], [8]. defining the dividing hyperplane. An example of a decision
Thereexists aclassof conventional pattern classifiers that tree and the corresponding hierarchical partitioning
has many similarities with the layered networks. This class induced by the tree are shown in Fig. 2. It i s easy to see that
of classifiers is called hierarchical or decision tree classi- as the depth of the tree increases, the resulting partitioning
fiers. As the name implies, these classifiers arrive at a deci- becomes more and more complex.
sion through a heirarchy of stages. Unlike many conven- Classification using decision tree is performed by trav-
tional pattern recognition techniques, the decision tree ersing the tree from the root node to one of the leaf nodes
classifiers also do not impose any restriction on the under- using the unknown pattern vector. The response elicited
lying distribution of the input data. These classifiers are by the unknown pattern i s the class or decision label
capable of producing arbitrarily complex decision bound- attached to the leaf node that i s reached by the unknown
aries that can be learned from a set of trainingvectors. While vector. I t i s obvious that all the conditions along any par-
the tree-based learning i s noniterative or single step, the ticular path from the root to the leaf node of the decision
neural net learning is incremental. The incremental learn- tree must be satisfied in order to reach that particular leaf
1606 PROCEEDINGS OF THE IEEE, VOL. 78, NO. IO, OCTOBER 1990
1
1 The most important consequence of the tree-to-network
mapping i s that it defines exactly the number of neurons
neededin eachof thethreelayersof neural network,aswelI
as a way of specifying the desired response for each of these
neurons, as I shall show later. Hitherto, this number of neu-
rons has been determined by empirical means, and the
credit assignment problem has been tackled with back-
propagation. In comparison to the standard feedforward
(a) (b) layered networks that are fullyconnected, the mapped net-
Fig. 2. (a) An example of a decision tree. Square boxes rep- work has far fewer connections. Except for one neuron in
resent terminal nodes. (b) Hierarchical partitioning of the the partitioning layer that corresponds to the root node of
two-dimensional space induced by the decision tree of (a).
the decision tree, the remaining neurons do not have con-
nections with all the neurons in the adjacent layers. A fewer
node. Thus, each path of a decision tree implements an AND number of connections i s an important advantage from the
operation on a set of half spaces. If two or more leaf nodes VLSl implementation point of view [ I l l given the present
result in the same action or decision, then the correspond- state of technology. To emphasize this difference in the
ing paths are in an OK relationship. Since a layered neural architecture, I shall henceforth refer to the mapped net-
network for classification also implements ANDing O f hyper- work as the entropy net due to the mutual information-
planes followed by oKing in the output layer, it i s obvious based data-driven tree generation methodology that i s dis-
that a decision tree and a layered network are equivalent cussed in the next section. Such a methodology i s a must
in termsof input-output mapping. Not onlythat, adecision if the tree-to-network mapping i s to be exploited.
tree can be restructured as a layered network by following While the above mapping rules transform a sequential
certain rules. These rules can be informally stated as fol- decision making process into a parallel process, the result-
lows. ing network, however, has the same limitations that were
The number of neurons in the first layer of the layered
exhibited by the MADALINE [7] type of early multilayer ANN
network equals the number of internal nodes of the deci- models; there is no adaptability beyond the first hidden
sion tree. Each of these neurons implements one of the layer. In terms of the decision trees, this limitation i s best
decision functions of internal nodes. This layer i s the par- described by saying that once a wrong path i s taken at an
titioning layer. internal node, there i s no way of recovering from the mis-
All leaf nodes have acorresponding neuron in the sec- take. The layered networks of neurons avoid this pitfall
ond hidden layer where the ANDing i s implemented. This because of their adaptability beyond the partitioning layer
layer i s the Aming layer. that allows some corrective actions after the first hidden
The number of neurons in the output layer equals the layer. As I shall show later, it i s possible to have the same
numberof distinct classesor actions.This layer implements adaptability capabilities in the entropy network as those of
the ming of those tree paths that lead to the same action. neural networks obtained through the backpropagation
The connections between the neurons from the par-
training. This becomes possible by combining the dual con-
titioning layer and the neuronsfrom the ANDinglayer imple- cepts of soft decision making and incremental learning that
' ment the hierarchy of the-tree. are described after the next section or recursive tree design
An example of tree restructuring following the above procedure.
rules i s shown in Fig. 3 for the decision tree of Fig. 2. As this
111. MUTUAL
INFORMATION AND RECURSIVETREEDESIGN
I A I I Several automatic tree generation algorithms exist in the
pattern recognition literature where the problem of tree
generation has been dealt with in two distinct ways. Some
of the early approaches break the tree design process into
two stages. The first stage yields a set of prototypes for each
pattern class. These prototypes are viewed as entries in a
decision table which i s later converted into a decision tree
layer layer layer using some optimal criterion. Examples of this type of tree
Fig. 3. Three-layered mapped network for the decision tree
design approaches can be found in [12], [13]. The problem
of Fig. 2(a). of finding prototypes from binary or discrete-valued pat-
terns i s considered in [14], [15]. The other tree design
example shows, it i s fairly straightforward to map a decision approaches try toobtain the treedirectlyfrom thegiven set
tree into a layered network of neurons. It should be noted of labeled pattern vectors. These direct approaches can be
that the mapping rules given above do not attempt to considered as a generalization of decision table conversion
optimize the number of neurons in the partitioning layer. approaches, with all the available pattern vectors for the
However, a better mapping can be achieved incorporating design forming the decision table entries. Examples of these
checks in the mapping rules for replications of the node direct tree design approaches can be found in [16]-[19].
decision functions in different parts of the tree to avoid the There are three basic tasks that need to be solved during
duplication of the neurons in the partitioning layer. It can the tree design process: 1 ) defining the hierarchical order-
be further enhanced by using algorithms [IO] that produce ing and choice of the node decision functions, 2) deciding
an optimal tree from the partitioning specified by a given when to declare a node as terminal node, and 3) setting up
decision tree. a decision rule at each terminal node. The last task i s the
I
To have the ability to modify weights in response to train- decision, the activation function associated with the neu-
ing examples, it i s essential to solve the credit assignment rons in the entropy network can be considered as a relay
problem for the intermediate layers, i.e., the partitioning function. Obviously, it i s not good for generalization, and
layer and the A w i n g layer. Fortunately, in the entropy net- must be replaced by sigmoid or some other soft nonlinear
work, this problem is automatically solved during the tree function. The signal level effect of having sigmoid nonlin-
design stage when different paths are assigned class labels. earity in place of relay nonlinearity i s that it changes all the
As can be noticed from the tree-to-network mapping, there internal signals from binary to analog. Consequently, small
exists agroup of neurons for every pattern class in the AND- changes in the features do not affect the response of the
ing layer of the network. The membership in this group i s neurons in different layers as much as compared to the
known from the tree-to-network mapping. Thus, given an binary signal case where a small change can result in acom-
example pattern from class c, it i s known that onlyone neu- plete reversal of the signal state. This enhances the capa-
ron from the group c of the ANDing layer neuron should bility of the entropy network to deal with problems having
produce an output of "1," while the remaining neurons noise and variability in the patterns, thus leading to better
from that group as well as those from the other groups generalization. Another consequence of sigmoid nonlin-
should produce a "0" response. Therefore, the solution to earity is that it altogether eliminates or minimizes the need
the credit assignment problem for the ANDing layer i s very for training the partitioning layer in the incremental learn-
simple: enhance the response of the neuron producing the ing mode. With the relay type of hard nonlinearity, it may
highest output among the neurons from group c, and sup- not be possible to converge to the proper weights in the
press the response of the remaining neurons in the ANDing ANDing layer for the desired response if the threshold val-
layer for a pattern from class c. This i s similar to the winner- ues in the partitioning layer, determined during the tree
take-allapproach followed for the neural net training in the design phase, are not proper. In such cases, the partitioning
self-organizing mode [24]. The reason that this simple layer training must adjust these threshold weights. How-
approach works in the entropy network i s that the network ever, with sigmoid activation function, the thresholds in the
has a built-in hierarchy of the decision tree which i s not partitioning layers can be off to a reasonable extent without
present in the other layered networks. Once the identity affecting the convergence of the weight values for the AND.
of the firing neuron i s the A w i n g layer i s established for a ing layer. This i s very important as it eliminates altogether
given example pattern, the desired response from the par- any reference to the previous layers during the learning
titioning layer neurons i s also established because of tree- process and provides a direct layer-by-layer progressive
to-network mapping. Hence, the tree-to-network mapping propagation method for learning.
not only provides the architecture of the multilayer net, but Another way of looking at the use of a sigmoid function
it also solves the credit assignment problem, thus giving i s that it allows the actual feature values to be carried across
rise to an incremental learning capability for the entropy different layers in a coded form, while the hard nonlin-
network. Since the initial network configuration itself pro- earities lose the actual feature value at the partitioning layer
vides a reasonably good starting solution to the various net- itself. In addition to the generality o r the better decision
work parameters, the incremental learning can be very fast, making that results from carrying through the actual fea-
leading to drastically reduced training time on the whole. ture values in coded form, the other very important con-
One very important characteristic of learning, whether sequence of carrying through the coded information i s that
incremental or not, i s that it should lead to generalization the final decision boundaries need not be piecewise linear,
capability on the part of the network. The amount of gen- as would be the case with hard nonlinearities. Moreover,
eralization achieved i s reflected in the network response for the linear boundaries, the orientation need not be par-
to those patterns that did not form part of the input-output allel to different feature axes. Thus, while the partitioning
mapping examples. To put it in other words, how well the layer neurons get information about single features only,
network interpolates among the examples shown deter- the neurons in the successive layers do receive information
mines the degree of generalization achieved. In a typical on many features, thereby producing boundaries of desired
neural net, the generalization i s achieved by incorporating shape and orientation.
nonlinearities in the neurons. This i s done by choosing a Based on the discussion thus far, the following steps are
nonlinear activation function for the neuron model. While suggested for designing entropy nets of pattern recogni-
the choice of a particular nonlinearity i s not crucial for tion tasks.
learning a specific set of input-output mapping examples, Divide the available set of input-output mapping
it does determine the amount of generalization that the net- examples in two parts: tree design set and network training
work will achieve [25]. The soft nonlinearities, such as the set. This should be done when a large number of input-out-
sigmoid function, provide much better generalization com- put examples i s available. Otherwise, the complete set of
pared to the relay type of hard nonlinearities. An intuitive examples should be used for tree design and network train-
understanding ofwhythe soft nonlinearities provide better ing.
generalization in a parallel environment like the multilayer * Using AMlC or a similar recursive tree design pro-
neural networks can be had by saying that these types of cedure, develop a decision tree for the given problem.
nonlinearities allows the decision making to be delayed as Map the tree into a three-layer neural net structure fol-
far as possible in the hope that at the later layers, more infor- lowing the rules given earlier.
mation will be available to make a better decision. Hard Associate the sigmoid or some other soft nonlinearity
nonlinearities, on the other hand, do not provide this priv- with every neuron. Train the ANDing and oRing layers of the
ilegeof postponing decisions, and consequentlydo not lead entropy network using the network training subset of the
to much generalization in the network. input-output mapping examples and the following pro-
Since each node of the decision tree produces a binary cedure for determing the weight change.
---
Let x ( p ) with category label L ( x ( p ) )be the input pattern parameter p i s called the learning factor that may or may
to the entropy network at the p t h presentation during the not remain fixed over the entire training. It controls the
training. Let R,(x(p))denote the response of t h e j t h neuron amount of correction that i s applied to determine new
from the ANDing/oRing layer. Let G( j ) represent the group weight values. The third parameter E specifies the termi-
membership of the j t h neuron and w,, the connection nation of the training procedure. The training is terminated
strength between the j t h neuron and the i t h neuron of the if none of the weight components differs from its previous
previous layer. Then value by an amount greater than E.
V. DESIGN
EXAMPLES
Following the above design approach for the layered net- (b)
works, I present i n this section two design examples. The Fig. 5. (a) An analog version of EN-OR problem in a two-
first example i s for a two-category pattern recognition task dimensional space. Two different tones represent two dif-
with only two features. The second example i s for a mul- ferent class regions that a neural network is expected to
ticategory, multifeature pattern recognition task involving learn. The dots correspond to input-output examples used
for network training. (b) Decision tree for the analog EX-OR
waveform classification. The purpose of the first example
problem.
i s to bring out the generalization capability of the entropy
network. The second example i s presented to demonstrate
and compare the efficacy of the methodology for multi- boundary i s the decision boundary that the neural net is
category problems in higher dimensional space. expected to learn. Entropy network mapping the treeof Fig.
There are three important parameters in the entropy net 5(b) is shown in Fig. 6. The threshold values of all the inter-
training. One of these is called the generalization constant
a that determines the generalization capability of the
entropy network. The parameter a controls the linear part
of the sigmoid nonlinearity. Fig. 4 shows several plots of the
sigmoid function for different values of a. Since a very large
a value brings the sigmoid nonlinearity very close to the
relay nonlinearity, the amount of generalization provided
by a i s inversely proportional to i t s value. The second
1610 PROCEEDINGS OF THE IEEE, VOL. 78, NO. IO, OCTOBER 1990
-
~~
I
ping examples that were used to train the entropy network. does not mimic the decision tree, but has i t s own gener-
While it i s possible t o train the A w i n g and oRing layers alization capability.
simultaneously, each layer was trained separately to deter- I n orderto furthertestthegeneralization capabilityof the
mine the progressive generalization capability of the network, the class labels of the training examples were
entropy network. The parametersp and e were set to 1.0 and changed to correspond t o the decision boundary of Fig. 9,
0.001, respectively, and different values for the generali-
zation parameter CY were used. The learning factor was made
to decrease in inverse proportion to the iteration number.
Results for two cases of training are shown i n Figs. 7 and
8. The upper left image in each of these figures represents
found to depend on the starting random weight values. The The initial choice for the weights was made randomly. The
average number of iterations for the Atwing layer over the training procedure was repeated many times with different
different runs was 193. The corresponding number for the initial weight values. No significant differences, either in
orzing layer was 42. The larger number of iterations for the terms of the number of iterations or the error rate, were
ANDing layer in this case is due to the great difference in the noticed due to initial weight selection. In all of the cases,
starting boundary that corresponds to the tree of Fig. 5(b) stable classification performance was attained within 40
and the desired boundary of Fig. 9. iterations.The best classification performance was obtained
The second example uses a synthetic data set to simulate for an (Y value of 2.0. Next, a number of classification exper-
awell-known waveform recognition problem from the clas- iments were performed to gauge the effectiveness of the
sification tree literature [19]. This was done to compare the entropy net. These experiments include entropy net train-
classification performance of the entropy net to several ing using the backpropagation program of the PDP soft-
other classifiers. The WAVE data consist of 21-dimensional ware [27. In this case, the learning rate of 0.5 was found
continuous-valued feature vectors coming from three unsuitable in terms of the number of iterations and clas-
classes with equal a priori probability. Each class of data is sification performance. However, using the learning rate of
generated bycombining itwith noisetwoof thethreewave- 0.1 resulted in stable performance with 40 iterations.
forms of Fig. 12 at 21 sampled positions (see [I91for details). Another entropy net mapping thedecision tree given in [I91
for the same problem was also realized and trained.
Fig. 14 summarizes the classification performance of sev-
Entropy CART
Enlropy BP
Entropy AMlG
Fig. 12. Three basic functions to generate waveform data.
1-NN
CART Tree
AMlG Tree
Thetraining dataset consistsof 300examplesthatwere used
to design the tree. The same set of examples was then used 0 0.05 0.1 0.15 0.2 0 . 2 5 0.3
to train the entropy net. The test data have 5000 vectors. Error Rate
Using the training vectors, the AMlG algorithm produced Fig. 14. Error rate for different classifiers for the waveform
the decision tree of Fig. 13 for waveform classification. The recognition problem. Entropy.BP represents the results
when the entropy net for the AMlG tree was trained using
backpropagation.
1612 PROCEEDINGS OF THE IEEE, VOL. 78, NO. 10, OCTOBER 1990
L
the tree-to-layered-network mapping rules provide a sys- [ l l ] L. A. Akers, M. R. Walker, D. K. Ferry, and R. 0. Grodin,"Lim-
tematic tool t o obtain layered network architecture. One ited interconnectivity in synthetic neural systems," in R. Eck-
miller and C. v.d. Malsburg, Eds., Neural Computers. New
very important property of the entropy network architec- York: Springer-Verlag, 1988.
ture i s that the problem of credit assignment does not exist 121 C. R. P. Hartmann, P. K. Varshney, K. C. Mehrotra, and C . L.
for these networks as it i s automatically solved during the Cerberich, "Application of information theory to the con-
tree learning stage. The issues of incremental learning and struction of efficient decision trees," /€€€ Trans. Inform. The-
generalization have been discussed, and the importance of ory, vOI. IT-28, pp. 565-577, July 1982.
131 I. K. Sethi and B. Chaterjee,"Efficient decision treedesign for
soft nonlinearities has been stressed. Finally, a two-stage discrete variable pattern recognition problems," PatternRec-
procedure has been given where the dominant aspects of ognition, vol. 9, pp. 197-206, 1978.
the problem are learned during the treedevelopment phase 141 -, "A learning classifier scheme for discrete variable pat-
and the generalization of the learned knowledge takes place tern recognition problems,'' /€€€ Trans. Syst., Man, Cybern.,
via the entropy network training. The effectiveness of the vol. SMC-8, pp. 49-52, Jan.1978.
151 J. C. Stoffel, "A classifer design technique for discrete vari-
proposed methodology has been shown through two able pattern recognition problems," I€€€Trans. Cornput., vol.
examples. C-23, pp. 428-441, 1974.
It needs to be mentioned that the tree-to-network map- 161 E. G . Henrichon and K. S. Fu, "A nonparametric partitioning
ping approach i s not without any flaws. Because of the use procedure for pattern classification," / € € E Trans. Comput.,
vol. C-18, pp. 614-624, July 1962.
of a single feature at each node during the tree design, it 171 I. K. Sethi and G. P. R. Sarvarayudu, "Hierarchical classifier
i s possible in many cases, the EX-OR problem for one, t o end design using mutual information," I€€€ Trans. Pattern Anal.
up with very large trees. One possible solution to avoid very Machine Intell., vol. PAMI-4, pp. 441-445, July 1982.
large trees i s to apply the Hotelling or principal component 181 J.L. Talmon, "A multiclass nonparametric partitioning algo-
rithm," in E. S. Celsema and L. N. Kanal, Eds., Pattern Rec-
transformation [28]t o the data first. In terms of the entropy ognition in Practice I / . Amsterdam: Elsevier Science Pub. B.
network, it i s equivalent t o adding an extra representation V. (North-Holland). 1986.
stage giving rise to a network which can be appropriately L. Breiman, J. Friedman, R. Olshen, and C. J. Stone, Classi-
called a hoteling-entropy net. Such networks are currently ficationandRegressionTrees. Belmont, CA: Wadsworth Int.
under study, along with a study on the limitations of par- Group, 1984.
L. Hyafil and R. L. Rivest, "Constructing optimal binary deci-
tially connected networks in comparison to fully connected sion trees is NP-complete," Inform. Processing Lett., vol. 5,
feedforward networks with respect to missing data and bro- pp. 15-17,1976.
ken connections. R. M. Fano, Transmission of Information. New York: Wiley,
1963.
R. M. Goodman and P. Smyth, "Decision tree design from a
ACKNOWLEDGMENT communication theory standpoint," /E€€ Trans. Inform. The-
I gratefully acknowledge the assistance of N. Ramesh, ory, vol. 34, pp. 979-994, Sept. 1988.
J. R. Quinlan, "Induction of decision trees," Machine Learn-
M. Otten, and G.Yu in running some experiments for me. ing, vol. 1, pp. 81-106, 1986.
I also thank Prof. A. Jain for many useful discussions. T. Kohonen, Self-organization and Associative Memory.
Berlin: Springer-Verlag, 1984.
REFEREN cEs C. J. MatheusandW. E. Hohensee,"Learning inartifical neural
systems," Univ. of Illnois, Urbana,Tech. Rep.TR-87-1394,1987.
D. J. Burr, "Experiments on neural net recognition of spoken B. Widrow and M. E. Hoff, "Adaptive switching circuits," in
and written text," /E€€ Trans. Acoust, Speech, Signal Pro- 7960 IRE WESCON Conv. Rec., part 4, 1960, pp. 96-104.
cessing, vol. 36, pp. 1162-1168, July 1988. J.L. McClelland and D. E. Rumelhart, Explorationsin Parallel
R. P. Lippmann, "An introduction to computing with neural Distributed Processing. Cambridge, MA: M.I.T. Press, 1988.
nets," lEEEASSPMag., pp. 4-22, Apr. 1987. R. 0. Duda and P. E. Hart, Pattern Classification and Scene
A. Wieland and R. Leighton, "Geometric analysis of neural Analysis. New York: Wiley, 1973.
network capabilities," in Proc. /E€€ Int. Conf. Neural Net-
works, Vol. ///, San Diego, CA, June1987, pp. 385-392.
D. E. Rumelhart, C. E. Hinton, and R. J. Williams, "Learning
internal representation by error propagation," in D. E. Rurnel-
hart and J. L. McClelland, Eds., Parallel Distributed Process-
ing: Explorations in the Microstructure of Cognition. Vol. 7: lshwar K. Sethi (Senior Member, IEEE)
Foundations. Cambridge, MA: M.I.T. Press, 1986. received the B.Tech (Hons.), M. Tech., and
D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, "A learning Ph.D. degrees in Electronics and Electrical
algorithm for Boltzmann machines," Cognitive Sci., vol. 9, Communication Engineering from the
pp. 147-169, 1985. Indian Institute of Technology, Kharagpur,
R. Hecht-Nielsen, "Counterpropagation networks," Appl. India, in 1969, 1971, and 1977 respectively.
Opt., vol. 26, pp. 4979-4984, Dec. 1987. He is presently an Associate Professorof
B. Widrow, R. G. Winter, and R. A. Baxter, "Layered neural Computer Science at Wayne State Univer-
nets for pattern recognition," /€€E Trans. Acoust., Speech, sity, Detroit, Michigan. His current research
Signal Processing, vol. 36, pp. 1109-1118, July 1988. interestsare in the applications of artificial
W. Y. Huang and R. P. Lippmann, "Comparison between neural networks to solve computer vision
neural net and conventional classifers," in Proc. / € € E 7st lnt. and pattern recognition problems. Prior to joining Wayne State
Conf. Neural Networks, Vol. IV, San Diego, CA, June1987, pp. University, he was on the faculty at the Indian Institute of Tech-
485-493. nology, Kharagpur, India. During the latter half of 1988, he was a
L. N. Kanal, "Problem-solving models and search strategies Visiting Professoratthe Indian InstituteofTechnology, New Delhi,
for pattern recognition," / € E € Trans. Pattern Anal. Machine India.
Intell., vol. PAMI-1, pp. 194-201, Apr. 1979. Dr. Sethi i s a member of the Association for Computing
H. J. Payne and W. S. Meisel, "An algorithm for constructing Machinery, the International Neural Network Society, and the
optimal binary decision trees," /€€E Trans. Cornput., vol. C-25, International Society for Optical Engineering. He i s currently an
pp. 905-916, Sept. 1977. Associate Editor of Pattern Recognition.