You are on page 1of 7

Neural Networks are Decision Trees

Caglar Aytekin
AI Lead
AAC Technologies
caglaraytekin@aactechnologies.com, cagosmail@gmail.com
arXiv:2210.05189v1 [cs.LG] 11 Oct 2022

Abstract dicted class. Although useful for purposes such as check-


ing whether the support area for decisions are sound, these
In this manuscript, we show that any neural network methods still lack a detailed logical reasoning of why such
having piece-wise linear activation functions can be rep- decision is made.
resented as a decision tree. The representation is equiva- Conversion between neural networks and interpretable
lence and not an approximation, thus keeping the accuracy by-design models -such as decision trees- has been a topic
of the neural network exactly as is. This equivalence shows of interest. In [6], a method was devised to initialize neural
that neural networks are indeed interpretable by design networks with decision trees. [7, 16, 21] also provides neu-
and makes the black-box understanding obsolete. We share ral network equivalents of decision trees. The neural net-
equivalent trees of some neural networks and show that works in these works have specific architectures, thus the
besides providing interpretability, tree representation can conversion lacks generalization to any model. In [20], neu-
also achieve some computational advantages. The analysis ral networks were trained in such a way that their decision
holds both for fully connected and convolutional networks, boundaries can be approximated by trees. This work does
which may or may not also include skip connections and/or not provide a correspondence between neural networks and
normalizations. decision trees, and merely uses the latter as a regulariza-
tion. In [5], a neural network was used to train a decision
tree. Such tree distillation is an approximation of a neural
1. Introduction network and not a direct conversion, thus performs poorly
Despite the immense success of neural networks over the on the tasks that the neural network was trained on.
past decade, the black-box nature of their predictions pre- Joint neural network and decision tree models [10], [13],
vent their wider and more reliable adoption in many indus- [11], [12], [14], [1], [8], [18] genarally use deep learning to
tries, such as health and security. This fact led researchers to assists some trees, or come up with a neural network struc-
investigate ways to explain neural network decisions. The ture so it resembles a tree. A recent work [19] replaces the
efforts in explaining neural network decisions can be cate- final fully connected layer of a neural network with a deci-
gorized into several approaches: saliency maps, approxima- sion tree. Since the backbone features are still that of neural
tion by interpretable methods and joint models. networks, the explanation is sought to be achieved via pro-
Saliency maps are ways of highlighting areas on the in- viding a means to humans to validate the decision as a good
put, of which a neural network make use of the most while or bad one, rather than a complete logical reasoning of the
prediction. An earlier work [17] takes the gradient of the decision.
neural network output with respect to the input in order to In this paper, we show that any neural network which
visualize an input-specific linearization of the entire net- has piece-wise linear activations has a directly equivalent
work. Another work [22] uses a deconvnet to go back to decision tree representation. Thus, the induced tree output
features from decisions. The saliency maps obtained via is exactly the same with that of the neural network and tree
these methods are often noisy and prevent a clear under- representation doesn’t limit or require altering of the neural
standing of the decisions made. Another track of meth- architecture in any way. Being a decision tree, we show that
ods [24], [15], [2], [4] make use of the derivative of a neural neural networks are indeed white boxes that are directly in-
network output with respect to an activation, usually the one terpretable and it is possible to explain every decision made
right before fully connected layers. This saliency maps ob- within the neural network. We show that the decision tree
tained by this track, and some other works [23], [9], [3] are equivalent of a neural network can be found for either fully
clearer in the sense of highlighting areas related to the pre- connected or convolutional neural networks which may in-
clude skip layers and normalizations as well. Besides the In Eq. 5, the categorization vector until layer i is de-
interpretability aspect, we show that the induced tree is also fined as follows: ci−1 = a0 k a1 k ...ai−1 , where k is the
advantageous to the corresponding neural network in terms concatenation operator.
of computational complexity, at the expense of increased One can directly observe from Eq. 5 that, the effec-
storage memory. tive matrix of layer i is only dependent on the categoriza-
tion vectors from previous layers. This indicates that in
2. Decision Tree Analysis of Neural Networks each layer, a new efficient filter is selected -to be applied
2.1. Fully Connected Networks on the network input- based on the previous categoriza-
tions/decisions. This directly shows that a fully connected
Let Wi be the weight matrix of a network’s ith layer. neural network can be represented as a single decision tree,
Let σ be any piece-wise linear activation function, and x0 where effective matrices acts as categorization rules. In
be the input to the neural network. Then, the output and an T
each layer i, response of effective matrix Ci−1 Ŵi is catego-
intermediate feature of a feed-forward neural network can
rized into ai vector, and based on this categorization result,
be represented as in Eq. 1. T
next layer’s effective matrix Ci Ŵi+1 is determined. A layer
mi
i is thus represented as k -way categorization, where mi is
N N (x0 ) = WTn−1 σ(WTn−2 σ(...WT1 σ(WT0 x0 ))) the number filters in each layer i and k is the total number
(1)
xi = σ(WTi−1 σ(...WT1 σ(WT0 x0 ))) of linear regions in an activation. This categorization in a
layer i can thus be represented by a tree of depth mi , where
Note that in Eq. 1, we omit any final activation (e.g. a node in any depth is branched into k categorizations.
softmax) and we ignore the bias term as it can be simply In order to better illustrate the equivalent decision tree of
included by concatenating a 1 value to each xi . The acti- a neural network, in Algorithm 1, we rewrite Eq. 5 for the
vation function σ acts as an element-wise scalar multiplica- entire network, as an algorithm. For the sake of simplic-
tion, hence the following can be written. ity and without loss of generality, we provide the algorithm
with the ReLU activation function, where a ∈ {0, 1}. It
can be clearly observed that, the lines 5 − 9 in Algorithm 1
WTi σ(WTi−1 xi−1 ) = WTi (ai−1 (WTi−1 xi−1 )) (2)
corresponds to a node in the decision tree, where a simple
In Eq. 2, ai−1 is a vector indicating the slopes of activa- yes/no decision is made.
tions in the corresponding linear regions where WTi−1 xi−1 The decision tree equivalent of a neural network can thus
fall into, denotes element-wise multiplication. Note that, be constructed as in Algorithm 2. Using this algorithm,
ai−1 can directly be interpreted as a categorization result we share a a tree representation obtained for a neural net-
since it includes indicators (slopes) of linear regions in ac- work with three layers, having 2,1 and 1 filter for layer 1,
tivation function. The Eq. 2 can be re-organized as follows. 2 and 3 respectively. The network has ReLU activation in
between layers, and no activation after last layer. It can be
observed from Algorithm 2P and Fig. 1 that the depth of a
WTi σ(WTi−1 xi−1 ) = (Wi ai−1 )T WTi−1 xi−1 (3) n−2
NN-equivalent tree is d = i=0 mi , and total number of
categories in last branch is 2d . At first glance, the number
In Eq. 3, we use as a column-wise element-wise
of categories seem huge. For example, if first layer of a neu-
multiplication on Wi . This corresponds to element-wise
ral network contains 64 filters, there would exist at least 264
multiplication by a matrix obtained via by repeating ai−1
branches in a tree, which is already intractable. But, there
column-vector to match the size of Wi . Using Eq. 3, Eq. 1
may occur violating and redundant rules that would provide
can be rewritten as follows.
lossless pruning of the NN-equivalent tree. Another obser-
vation is that, it is highly likely that not all categories will
N N (x0 ) = (Wn−1 an−2 )T (Wn−2 an−3 )T be realized during training due to the possibly larger num-
(4) ber of categories than training data. These categories can
...(W1 a0 )T WT0 x0 be pruned as well based on the application, and the data
From Eq. 4, one can define an effective weight matrix falling into these categories can be considered invalid, if
T the application permits. In the next section, we show that
Ŵi of a layer i to be applied directly on input x0 as follows: such redundant, violating and unrealized categories indeed
exist, by analysing decision trees of some neural networks.
T But before that, we show that the tree equivalent of a neu-
Ci−1 Ŵi = (Wi ai−1 )T ...(W1 a0 )T WT0
(5) ral network exists for skip connections, normalizations and
T
Ci−1 Ŵi x0 = WTi xi convolutions as well.
Algorithm 1: Algorithm of Eq. 5 for ReLU net- Algorithm 2: Algorithm of converting neural net-
works works to decision trees
1 Ŵ = W0 1 Initialize Tree: Set root.
2 for i = 0 to n − 2 do 2 Branch all leafs to k nodes, decision rule is first
3 a = [] effective filter.
4 for j = 0 to mi − 1 do 3 Branch all nodes to k more nodes, and repeat until
T all effective filters in a layer is covered.
5 if Ŵij x0 > 0 then
4 Calculate effective matrix for each leaf via Eq. 5.
6 a.append(1)
7 else Repeat 2,3.
5 Repeat 4 until all layers are covered.
8 a.append(0)
6 return Tree
9 end
10 end
11 Ŵ = Wi+1 a
2.2. Convolutional Neural Networks
12 end
T Let Ki : Ci+1 × Ci × Mi × Ni be the convolution kernel
13 return Ŵ x0
for layer i, applying on an input Fi : Ci × Hi × Wi .
One can write the output of a convolutional neural net-
work CN N (F0 ), and an intermediate feature Fi as follows.
2.1.1 Skip Connections

We analyse a residual neural network of the following type: CN N (F0 ) = Kn−1 ∗ σ(Kn−2 ∗ σ(...σ(K0 ∗ F0 ))
(9)
Fi = σ(Ki−1 ∗ σ(...σ(K0 ∗ F0 ))
r x0 = WT0 x0
(6) Similar to the fully connected network analysis, one can
= r xi−1 + WTi σ(r xi−1 ) write the following, due to element-wise scalar multiplica-
r xi
tion nature of the activation function.
Using Eq. 6, via a similar analysis in Sec. 2.1, one can
rewrite r xi as follows.
Ki ∗ σ(Ki−1 ∗ Fi−1 ) = (Ki ai−1 ) ∗ (Ki−1 ∗ Fi−1 ) (10)
T
r xi = ai−1 Ŵi r xi−1 In Eq. 10, ai−1 is of same spatial size as Ki and consists
(7) of the slopes of activation function in corresponding regions
T
ai−1 Ŵi = I + (Wi ai−1 )T in the previous feature Fi−1 . Note that the above only holds
for a specific spatial region, and there exists a separate ai−1
T
Finally, using ai−1 Ŵi in Eq. 7, one can define effective for each spatial region that the convolution Ki−1 is applied
matrices for residual neural networks as follows. to. For example, if Ki−1 is a 3 × 3 kernel, there exists a
separate ai−1 for all 3 × 3 regions that the convolution is
T applied to. An effective convolution Ci−1 K̂i can be written
r xi = r Ŵi x0
(8) as follows.
T T T T T
r Ŵi = ai−1 Ŵi ai−2 Ŵi−1 ...a0 Ŵ1 W0
ci−1 K̂i = (Ki ai−1 ) ∗ ... ∗ (K1 a0 ) ∗ K0
One can observe from Eq. 8 that, for layer i, the residual (11)
T ci−1 K̂i ∗ x0 = Ki ∗ xi
effective matrix r Ŵi is defined solely based on categoriza-
tions from the previous activations. Similar to the analysis Note that in Eq. 11, Ci−1 K̂i contains specific effective
in Sec. 2.1, this enables a tree equivalent of residual neural convolutions per region, where a region is defined accord-
networks. ing to the receptive field of layer i. c is defined as the con-
catenated categorization results of all relevant regions from
2.1.2 Normalization Layers previous layers.
One can observe from Eq. 11 that effecive convolu-
A separate analysis is not needed for any normalization tions are only dependent on categorizations coming from
layer, as popular normalization layers are linear, and after activations, which enables the tree equivalence -similar to
training, they can be embedded into the linear layer that it the analysis for fully connected network. A difference from
comes after or before, in pre-activation or post-activation fully connected layer case is that many decisions are made
normalizations respectively. on partial input regions rather than entire x0 .
WT00 x0 > 0

WT01 x0 > 0 WT01 x0 > 0

T T T T
ˆ
00 W10 x0 > 0 01 W10
ˆ x0 > 0 10 W10
ˆ x0 > 0 11 W10
ˆ x> 0

T T T T T T T T
000 W20
ˆ x0 001 W20
ˆ x0 010 W20
ˆ x0 011 W20
ˆ x0 100 W20
ˆ x0 101 W20
ˆ x0 110 W20
ˆ x0 111 W20
ˆ x0

Figure 1. Decision Tree for a 2-layer ReLU Neural Network

x < −1.16

x < 0.32 x < 0.32

x>1 x > 0.54 x > 0.52 x > 0.39

x < −0.1 x0 < −0.1 x < 0.11 x < 0.11 x < −0.7 x < −0.7 x < −0.38 x < −0.38

T T T T T T T T T T T T T T T T
0 Ŵ x0 1 Ŵ x0 2 Ŵ x 3 Ŵ x 4 Ŵ x 5 Ŵ x 6 Ŵ x 7 Ŵ x 8 Ŵ x 9 Ŵ x 10 Ŵ x 11 Ŵ x 12 Ŵ x 13 Ŵ x 14 Ŵ x 15 Ŵ x

Figure 2. Decision Tree for a y = x2 Regression Neural Network

3. Experimental Results removing the left child in this case, and merging the catego-
rization rule to the stricter one : x < −1.16 in the particular
First, we make a toy experiment where we fit a neural case. Via cleaning the decision tree in Fig. 2, we obtain the
network to: y = x2 equation. The neural network has 3 simpler tree in Fig. 3a, which only consists of 5 categories
dense layers with 2 filters each, except for last layer which instead of 16. The 5 categories are directly visible also from
has 1 filter. The network uses leaky-ReLU activations after the model response in Fig. 3b. The interpretation of the neu-
fully connected layers, except for the last layer which has no ral network is thus straightforward: for each region whose
post-activation. Fig. 2 shows the decision tree correspond- boundaries are determined via the decision tree represen-
ing to the neural network. In the tree, every black rectan- tation, the network approximates the non-linear y = x2
gle box indicates a rule, left child from the box means the equation by a linear equation. One can clearly interpret and
rule does not hold, and the right child means the rule holds. moreover make deduction from the decision tree, some of
For better visualization, the rules are obtained via convert- which are as follows. The neural network is unable to grasp
ing wT x + β > 0 to direct inequalities acting on x. This the symmetrical nature of the regression problem which is
can be done for the particular regression y = x2 , since x evident from the fact that the decision boundaries are asym-
is a scalar. In every leaf, the network applies a linear func- metrical. The region in below −1.16 and above 1 is un-
tion -indicated by a red rectangle- based on the decisions bounded and thus neural decisions lose accuracy as x goes
so far. We have avoided writing these functions explicitly beyond these boundaries.
due to limited space. At first glance, the tree representa- Next, we investigate another toy problem of classifying
tion of Pa neural network in this example seems large due half-moons and analyse the decision tree produced by a neu-
n−2
to the 2 i mi = 24 = 16 categorizations. However, we ral network. We train a fully connected neural network with
notice that a lot of the rules in the decision tree is redun- 3 layers with leaky-ReLU activations, except for last layer
dant, and hence some paths in the decision tree becomes which has sigmoid activation. Each layer has 2 filters ex-
invalid. An example to redundant rule is checking x < 0.32 cept for the last layer which has 1. The cleaned decision
after x < −1.16 rule holds. This directly creates the invalid tree induced by the trained network is shown in Fig. 4. The
left child for this node. Hence, the tree can be cleaned via decision tree finds many categories whose boundaries are
x < −1.16

x < 0.32 −3.67x − 3.22

x>1 x < 0.11.

0.55x + 0.09 3.47x − 2.83 2.37x − 0.50 −1.00x − 0.12


(a) Cleaned Tree (b) Neural Network Approximation of y = x2

Figure 3. Cleaned Decision Tree for a y = x2 Regression Neural Network

−0.98x − 0.49y + 0.95

−1.6x + 0.2y − 0.07 −1.6x + 0.2y − 0.07

−0.15x + 0.19y + 0.41 0 0.5x + 0.52y − 0.22 −0.5x + 0.64y − 0.27

1 0.45x − 0.3y + 0.05 1 −0.46x − 0.76y + 0.94 1.51x − 1y + 1.03 1.51x − 1y + 1.03

0 1.6x − 1.35y − 1.44 0 −2.72x − 3.53y + 2.77 1.59x − 1.35y + 0.78 4.18x − 3.17y + 2.54 0 5.31x − 4.51y + 3.14

0 1 0 1 0 1 0 1 0 1

Figure 4. Classification Tree for a Half-Moon Classification Neural Network

y = x2 Half-Moon
Param. Comp. Mult./Add. Param. Comp. Mult./Add.
Tree 14 2.6 2 39 4.1 8.2
NN 13 4 16 15 5 25

Table 1. Computation and memory analysis of toy problems.

such as some regions are very well-defined, bounded and


the classifications they make are perfectly in line with the
training data, thus these regions are very reliable. There are
unbounded categories which help obtaining accurate classi-
fication boundaries, yet fail to provide a compact represen-
tation of the training data, these may correspond to inaccu-
Figure 5. Categorizations made by the decision tree for half-moon rate extrapolations made by neural decisions. There are also
dataset some categories that emerged although none of the training
data falls to them.
Besides the interpretability aspect, the decision tree rep-
determined by the rules in the tree, where each category resentation also provides some computational advantages.
is assigned a single class. In order to better visualize the In Table 1, we compare the number of parameters, float-
categories, we illustrate them with different colors in Fig. point comparisons and multiplication or addition operations
5. One can make several deductions from the decision tree of the neural network and the tree induced by it. Note that
the comparisons, multiplications and additions in the tree [8] Mason McGill and Pietro Perona. Deciding how to decide:
representation are given as expected values, since per each Dynamic routing in artificial neural networks. In Interna-
category depth of the tree is different. As the induced tree tional Conference on Machine Learning, pages 2363–2372.
is an unfolding of the neural network, it covers all possi- PMLR, 2017. 1
ble routes and keeps all possible effective filters in mem- [9] Mohammed Bany Muhammad and Mohammed Yeasin.
ory. Thus, as expected, the number of parameters in the tree Eigen-cam: Class activation map using principal compo-
representation of a neural network is larger than that of the nents. In 2020 International Joint Conference on Neural
Networks (IJCNN), pages 1–7. IEEE, 2020. 1
network. In the induced tree, in every layer i, a maximum
of mi filters are applied directly on the input, whereas in the [10] Ravi Teja Mullapudi, William R Mark, Noam Shazeer, and
Kayvon Fatahalian. Hydranets: Specialized dynamic archi-
neural network always mi filters are applied on the previous
tectures for efficient inference. In Proceedings of the IEEE
feature, which is usually much larger than the input in the
Conference on Computer Vision and Pattern Recognition,
feature dimension. Thus, computation-wise, the tree repre- pages 8080–8089, 2018. 1
sentation is advantageous compared to the neural network
[11] Calvin Murdock, Zhen Li, Howard Zhou, and Tom Duerig.
one. Blockout: Dynamic model selection for hierarchical deep
networks. In Proceedings of the IEEE conference on com-
4. Conclusion puter vision and pattern recognition, pages 2583–2591,
2016. 1
In this manuscript, we have shown that a neural network
[12] Venkatesh N Murthy, Vivek Singh, Terrence Chen, R Man-
with piece-wise linear activation functions, can be equiva-
matha, and Dorin Comaniciu. Deep decision network for
lently represented as a decision tree. The tree equivalence multi-class image classification. In Proceedings of the
holds for fully connected layers, convolutional layers, resid- IEEE conference on computer vision and pattern recogni-
ual connections and normalizations. This tree representa- tion, pages 2240–2248, 2016. 1
tion provides a direct way to clearly interpret neural net- [13] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster,
works and makes the long withstanding black-box under- stronger. In Proceedings of the IEEE conference on computer
standing of neural networks obsolete. vision and pattern recognition, pages 7263–7271, 2017. 1
[14] Anirban Roy and Sinisa Todorovic. Monocular depth esti-
References mation using neural regression forest. In Proceedings of the
IEEE conference on computer vision and pattern recogni-
[1] Karim Ahmed, Mohammad Haris Baig, and Lorenzo Torre-
tion, pages 5506–5514, 2016. 1
sani. Network of experts for large-scale image categoriza-
tion. In European Conference on Computer Vision, pages [15] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das,
516–532. Springer, 2016. 1 Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra.
Grad-cam: Visual explanations from deep networks via
[2] Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader,
gradient-based localization. In Proceedings of the IEEE in-
and Vineeth N Balasubramanian. Grad-cam++: General-
ternational conference on computer vision, pages 618–626,
ized gradient-based visual explanations for deep convolu-
2017. 1
tional networks. In 2018 IEEE winter conference on appli-
cations of computer vision (WACV), pages 839–847. IEEE, [16] Ishwar Krishnan Sethi. Entropy nets: from decision trees
2018. 1 to neural networks. Proceedings of the IEEE, 78(10):1605–
[3] Edo Collins, Radhakrishna Achanta, and Sabine Susstrunk. 1613, 1990. 1
Deep feature factorization for concept discovery. In Pro- [17] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman.
ceedings of the European Conference on Computer Vision Deep inside convolutional networks: Visualising image
(ECCV), pages 336–352, 2018. 1 classification models and saliency maps. arXiv preprint
[4] Rachel Lea Draelos and Lawrence Carin. Use hirescam in- arXiv:1312.6034, 2013. 1
stead of grad-cam for faithful explanations of convolutional [18] Andreas Veit and Serge Belongie. Convolutional networks
neural networks. arXiv e-prints, pages arXiv–2011, 2020. 1 with adaptive inference graphs. In Proceedings of the Euro-
[5] Nicholas Frosst and Geoffrey Hinton. Distilling a neu- pean Conference on Computer Vision (ECCV), pages 3–18,
ral network into a soft decision tree. arXiv preprint 2018. 1
arXiv:1711.09784, 2017. 1 [19] Alvin Wan, Lisa Dunlap, Daniel Ho, Jihan Yin, Scott Lee,
[6] Kelli D Humbird, J Luc Peterson, and Ryan G McClar- Henry Jin, Suzanne Petryk, Sarah Adel Bargal, and Joseph E
ren. Deep neural network initialization with decision trees. Gonzalez. Nbdt: neural-backed decision trees. arXiv
IEEE transactions on neural networks and learning systems, preprint arXiv:2004.00221, 2020. 1
30(5):1286–1295, 2018. 1 [20] Mike Wu, Michael Hughes, Sonali Parbhoo, Maurizio Zazzi,
[7] Peter Kontschieder, Madalina Fiterau, Antonio Criminisi, Volker Roth, and Finale Doshi-Velez. Beyond sparsity: Tree
and Samuel Rota Bulo. Deep neural decision forests. In Pro- regularization of deep models for interpretability. In Pro-
ceedings of the IEEE international conference on computer ceedings of the AAAI conference on artificial intelligence,
vision, pages 1467–1475, 2015. 1 volume 32, 2018. 1
[21] Yongxin Yang, Irene Garcia Morillo, and Timothy M
Hospedales. Deep neural decision trees. arXiv preprint
arXiv:1806.06988, 2018. 1
[22] Matthew D Zeiler and Rob Fergus. Visualizing and under-
standing convolutional networks. In European conference on
computer vision, pages 818–833. Springer, 2014. 1
[23] Jianming Zhang, Sarah Adel Bargal, Zhe Lin, Jonathan
Brandt, Xiaohui Shen, and Stan Sclaroff. Top-down neu-
ral attention by excitation backprop. International Journal
of Computer Vision, 126(10):1084–1102, 2018. 1
[24] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva,
and Antonio Torralba. Learning deep features for discrimina-
tive localization. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 2921–2929,
2016. 1

You might also like