Professional Documents
Culture Documents
Neural Networks Are Decision Trees: Caglar Aytekin AI Lead AAC Technologies
Neural Networks Are Decision Trees: Caglar Aytekin AI Lead AAC Technologies
Caglar Aytekin
AI Lead
AAC Technologies
caglaraytekin@aactechnologies.com, cagosmail@gmail.com
arXiv:2210.05189v1 [cs.LG] 11 Oct 2022
We analyse a residual neural network of the following type: CN N (F0 ) = Kn−1 ∗ σ(Kn−2 ∗ σ(...σ(K0 ∗ F0 ))
(9)
Fi = σ(Ki−1 ∗ σ(...σ(K0 ∗ F0 ))
r x0 = WT0 x0
(6) Similar to the fully connected network analysis, one can
= r xi−1 + WTi σ(r xi−1 ) write the following, due to element-wise scalar multiplica-
r xi
tion nature of the activation function.
Using Eq. 6, via a similar analysis in Sec. 2.1, one can
rewrite r xi as follows.
Ki ∗ σ(Ki−1 ∗ Fi−1 ) = (Ki ai−1 ) ∗ (Ki−1 ∗ Fi−1 ) (10)
T
r xi = ai−1 Ŵi r xi−1 In Eq. 10, ai−1 is of same spatial size as Ki and consists
(7) of the slopes of activation function in corresponding regions
T
ai−1 Ŵi = I + (Wi ai−1 )T in the previous feature Fi−1 . Note that the above only holds
for a specific spatial region, and there exists a separate ai−1
T
Finally, using ai−1 Ŵi in Eq. 7, one can define effective for each spatial region that the convolution Ki−1 is applied
matrices for residual neural networks as follows. to. For example, if Ki−1 is a 3 × 3 kernel, there exists a
separate ai−1 for all 3 × 3 regions that the convolution is
T applied to. An effective convolution Ci−1 K̂i can be written
r xi = r Ŵi x0
(8) as follows.
T T T T T
r Ŵi = ai−1 Ŵi ai−2 Ŵi−1 ...a0 Ŵ1 W0
ci−1 K̂i = (Ki ai−1 ) ∗ ... ∗ (K1 a0 ) ∗ K0
One can observe from Eq. 8 that, for layer i, the residual (11)
T ci−1 K̂i ∗ x0 = Ki ∗ xi
effective matrix r Ŵi is defined solely based on categoriza-
tions from the previous activations. Similar to the analysis Note that in Eq. 11, Ci−1 K̂i contains specific effective
in Sec. 2.1, this enables a tree equivalent of residual neural convolutions per region, where a region is defined accord-
networks. ing to the receptive field of layer i. c is defined as the con-
catenated categorization results of all relevant regions from
2.1.2 Normalization Layers previous layers.
One can observe from Eq. 11 that effecive convolu-
A separate analysis is not needed for any normalization tions are only dependent on categorizations coming from
layer, as popular normalization layers are linear, and after activations, which enables the tree equivalence -similar to
training, they can be embedded into the linear layer that it the analysis for fully connected network. A difference from
comes after or before, in pre-activation or post-activation fully connected layer case is that many decisions are made
normalizations respectively. on partial input regions rather than entire x0 .
WT00 x0 > 0
T T T T
ˆ
00 W10 x0 > 0 01 W10
ˆ x0 > 0 10 W10
ˆ x0 > 0 11 W10
ˆ x> 0
T T T T T T T T
000 W20
ˆ x0 001 W20
ˆ x0 010 W20
ˆ x0 011 W20
ˆ x0 100 W20
ˆ x0 101 W20
ˆ x0 110 W20
ˆ x0 111 W20
ˆ x0
x < −1.16
x < −0.1 x0 < −0.1 x < 0.11 x < 0.11 x < −0.7 x < −0.7 x < −0.38 x < −0.38
T T T T T T T T T T T T T T T T
0 Ŵ x0 1 Ŵ x0 2 Ŵ x 3 Ŵ x 4 Ŵ x 5 Ŵ x 6 Ŵ x 7 Ŵ x 8 Ŵ x 9 Ŵ x 10 Ŵ x 11 Ŵ x 12 Ŵ x 13 Ŵ x 14 Ŵ x 15 Ŵ x
3. Experimental Results removing the left child in this case, and merging the catego-
rization rule to the stricter one : x < −1.16 in the particular
First, we make a toy experiment where we fit a neural case. Via cleaning the decision tree in Fig. 2, we obtain the
network to: y = x2 equation. The neural network has 3 simpler tree in Fig. 3a, which only consists of 5 categories
dense layers with 2 filters each, except for last layer which instead of 16. The 5 categories are directly visible also from
has 1 filter. The network uses leaky-ReLU activations after the model response in Fig. 3b. The interpretation of the neu-
fully connected layers, except for the last layer which has no ral network is thus straightforward: for each region whose
post-activation. Fig. 2 shows the decision tree correspond- boundaries are determined via the decision tree represen-
ing to the neural network. In the tree, every black rectan- tation, the network approximates the non-linear y = x2
gle box indicates a rule, left child from the box means the equation by a linear equation. One can clearly interpret and
rule does not hold, and the right child means the rule holds. moreover make deduction from the decision tree, some of
For better visualization, the rules are obtained via convert- which are as follows. The neural network is unable to grasp
ing wT x + β > 0 to direct inequalities acting on x. This the symmetrical nature of the regression problem which is
can be done for the particular regression y = x2 , since x evident from the fact that the decision boundaries are asym-
is a scalar. In every leaf, the network applies a linear func- metrical. The region in below −1.16 and above 1 is un-
tion -indicated by a red rectangle- based on the decisions bounded and thus neural decisions lose accuracy as x goes
so far. We have avoided writing these functions explicitly beyond these boundaries.
due to limited space. At first glance, the tree representa- Next, we investigate another toy problem of classifying
tion of Pa neural network in this example seems large due half-moons and analyse the decision tree produced by a neu-
n−2
to the 2 i mi = 24 = 16 categorizations. However, we ral network. We train a fully connected neural network with
notice that a lot of the rules in the decision tree is redun- 3 layers with leaky-ReLU activations, except for last layer
dant, and hence some paths in the decision tree becomes which has sigmoid activation. Each layer has 2 filters ex-
invalid. An example to redundant rule is checking x < 0.32 cept for the last layer which has 1. The cleaned decision
after x < −1.16 rule holds. This directly creates the invalid tree induced by the trained network is shown in Fig. 4. The
left child for this node. Hence, the tree can be cleaned via decision tree finds many categories whose boundaries are
x < −1.16
1 0.45x − 0.3y + 0.05 1 −0.46x − 0.76y + 0.94 1.51x − 1y + 1.03 1.51x − 1y + 1.03
0 1.6x − 1.35y − 1.44 0 −2.72x − 3.53y + 2.77 1.59x − 1.35y + 0.78 4.18x − 3.17y + 2.54 0 5.31x − 4.51y + 3.14
0 1 0 1 0 1 0 1 0 1
y = x2 Half-Moon
Param. Comp. Mult./Add. Param. Comp. Mult./Add.
Tree 14 2.6 2 39 4.1 8.2
NN 13 4 16 15 5 25