You are on page 1of 10

29/12/2023, 00:11 Understanding Decision Trees

Decision Tree

Hemant Thapa

Importing Libraries
In [1]: import math
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Decision Tree Algorithm


A Decision Tree stands out as a widely used and robust machine learning technique that I've had the opportunity to study. It falls under the
category of supervised learning methods that do not rely on specific assumptions about the data and can be applied to both classification and
regression tasks. The core idea is to construct a model that predicts the outcome of a target variable by acquiring straightforward decision
rules from the characteristics of the data. In classification, the target values are distinct categories, while in regression, they are continuous
numerical values.ble!

Decision Trees consist of a tree-like structure where each internal node represents a decision rule based on one of the data features, and each
leaf node corresponds to the predicted outcome or class. The process of constructing a Decision Tree involves selecting the best features to
split the data and optimizing the decision rules at each node to maximize predictive accuracy.

1. Root Node:
The root node is the topmost node in a decision tree.
It represents the entire dataset or a subset of the data at the beginning of the tree-building process.
The root node serves as the starting point for the tree's construction.
At the root node, a decision is made to split the data into subsets based on the values of a specific feature (attribute). This feature is
chosen because it results in the best separation of data according to a certain criterion (e.g., Gini impurity, information gain, variance
reduction).
It's like the most important question we ask when trying to make decisions. tThe root node helps us decide how to divide our data into
smaller groups (subsets).

file:///C:/Users/heman/Downloads/Understanding Decision Trees.html 1/10


29/12/2023, 00:11 Understanding Decision Trees

Question 1 - Which feature (attribute) should we use to split our data?

Question 2 - What's the best way to divide our data into more manageable groups?

Answer 1 - We choose the feature (attribute) that provides the most useful information for making decisions.

Answer 2 - This feature is selected based on the highest Information Gain or Gini Gain, which helps us make the best possible split.

2. Decision Nodes:
Decision nodes are the internal nodes of the decision tree, situated between the root node and the terminal nodes.
Each decision node represents a specific decision or condition based on a feature's value.
These nodes serve as points where the dataset is split into subsets according to the decision or condition.
The decision node's role is to determine which branch of the tree to follow based on whether the condition is true or false for a particular
data point.
Decision nodes are like the middle managers in our decision-making process. they help us decide whether we need to split our data into
more detailed groups or not. decision nodes act as traffic controllers for data, guiding it down different paths.

3. Terminal Nodes (Leaf Nodes):


Terminal nodes, often referred to as leaf nodes, are the endpoints of the decision tree.
They do not contain any further decisions or splits.
Terminal nodes represent the final predicted outcome or class for a specific subset of data that has reached that point in the tree.
In classification tasks, each leaf node corresponds to a class label, indicating the predicted class for data points that fall into that node's
subset. In regression tasks, leaf nodes contain numerical values representing predictions.
Leaf nodes treated as the final outcomes or decisions in our decision tree. They provide us with the answer or prediction.

4. Gini Impurity, Information Gain and Variance Reduction (for Regression)

Gini Impurity:
Gini impurity is a measure of the disorder or impurity within a dataset.
In the context of a decision tree, it quantifies the probability of incorrectly classifying a randomly chosen element from the dataset.
A Gini impurity of 0 indicates that the dataset is perfectly pure, meaning all elements belong to the same class. A Gini impurity of 1 means
that the dataset is completely impure, with elements evenly distributed among different classes.
When deciding how to split a dataset at a decision node, the algorithm calculates the Gini impurity for each possible split and selects the
split that minimises the impurity, resulting in a more homogeneous subset.

Information Gain:
Information gain measures the reduction in uncertainty or entropy achieved by a particular split in the data.
Entropy is a concept from information theory that quantifies the disorder or randomness in a dataset. High entropy indicates high
disorder, while low entropy indicates order or purity.
Information gain calculates the difference in entropy before and after a split. A high information gain means that the split results in a
more orderly separation of data.
Decision tree algorithms aim to maximise information gain when choosing which feature to split on. They select the feature that leads to
the greatest reduction in uncertainty or entropy within the child nodes.

Variance Reduction (for Regression):


While Gini impurity and information gain are primarily used for classification tasks, variance reduction is specifically used in regression
tasks.
Variance reduction measures the reduction in variance of the target variable within a dataset.
The goal in regression decision trees is to minimise the variance of the target variable within each child node. A lower variance indicates
that the predicted values within the node are more consistent and accurate.
When deciding how to split a dataset in a regression tree, the algorithm calculates the variance of the target variable for each possible
split and selects the split that results in the greatest reduction in variance.

5. Advantages:
Interpretability: Decision Trees offer a clear and interpretable model, making it easy to understand why a particular decision was made.
This transparency is crucial for applications where understanding the reasoning behind predictions is essential, such as in healthcare and
finance.

Versatility: Decision Trees can handle both categorical and numerical data, as well as mixed data types. They are robust against outliers
and missing values, making them suitable for real-world datasets.

Feature Importance: Decision Trees provide a measure of feature importance, which helps identify which features have the most
significant impact on predictions. This information can guide feature selection and engineering efforts.

file:///C:/Users/heman/Downloads/Understanding Decision Trees.html 2/10


29/12/2023, 00:11 Understanding Decision Trees

6. Challenges:
Overfitting: Decision Trees are prone to overfitting, where the model captures noise in the training data, leading to poor generalisation on
unseen data. Techniques like pruning and setting a maximum depth are used to mitigate overfitting.

Instability: Small changes in the data can result in different tree structures, making Decision Trees somewhat unstable. Ensemble methods
like Random Forest and Gradient Boosting address this issue by combining multiple trees to improve stability and predictive performance.

7. Popular types of decision tree algorithms

ID3 (Iterative Dichotomiser 3):

ID3 is one of the earliest decision tree algorithms developed by Ross Quinlan. It is primarily used for classification tasks.
ID3 uses the entropy and information gain concepts to determine the best features for splitting the data at each node.
It builds a tree by recursively selecting attributes that result in the most significant information gain.

CART (Classification And Regression Trees):

CART is a versatile decision tree algorithm that can be used for both classification and regression problems.
Instead of entropy, CART uses Gini impurity to evaluate the quality of splits in classification tasks. For regression, it assesses the mean
squared error reduction.
CART constructs binary trees, meaning that each internal node has exactly two child nodes.

Chi-Square (χ²) Automatic Interaction Detection (CHAID):

CHAID is a decision tree algorithm primarily used for categorical target variables and categorical predictors.
It employs statistical tests like the Chi-Square test to determine the best splits based on the significance of relationships between
variables.
CHAID is particularly useful for exploring interactions between categorical variables.

Reduction in Variance (for Regression):

This method is specific to regression problems and is often used with algorithms like C4.5 or CART.
Instead of impurity measures, it focuses on reducing the variance of the target variable within each split.
In each node, it aims to find a split that results in the smallest total variance within the child nodes, leading to a more accurate regression
model.

8. ID3 (Iterative Dichotomiser)

8.1 Homogeneity

Homogeneity is the quality or state of being uniform, consistent, or similar in nature. In various contexts, homogeneity implies that the
elements or components within a group or system are alike or exhibit a high degree of similarity with respect to a particular characteristic
or property.

Homogeneity of variance refers to the assumption that the variances (the spread or dispersion of data) are approximately equal across
different groups or samples being compared. Violations of this assumption can impact the validity of statistical tests.

we can calculate the variance or standard deviation of the data points within each group. If the variances are similar or within an
acceptable range, it suggests homogeneity. You may also use graphical methods such as histograms or box plots to visualise the
distribution of data within each group.

8.2 What is “Entropy”? and What is its function?

Entropy is used as a measure of impurity or disorder within a dataset. It helps decide how to split data at decision nodes to create more
homogeneous subsets.
The function of entropy is to guide the construction of decision trees by identifying features that result in the greatest reduction of
uncertainty (entropy) in the child nodes after a split. It helps select the most informative features for classification or regression tasks.

Entropy(s)

The more uncertain or unpredictable the outcomes, the higher the entropy.

H (S) = − ∑ pi log (pi )


2

i=1

file:///C:/Users/heman/Downloads/Understanding Decision Trees.html 3/10


29/12/2023, 00:11 Understanding Decision Trees

−p p n n
Entropy(S) = log ( ) − log ( )
2 2
p + n p + n p + n p + n

Entropy(S) represents the entropy of the dataset S.


p is the number of positive instances (e.g., instances belonging to a particular class) in the dataset S.
n is the number of negative instances (e.g., instances not belonging to that class) in the dataset S.
log2 (x) represents the logarithm of x with base 2.

Dataset S :

Contains elements of two classes, positive (p) and negative (n).

First Term :
−p p
log ( )
p+n 2 p+n

is the probability of an element being in the positive class.


p

p+n

p
log (
2 p+n
) calculates the information (in bits) for this class.
The whole term represents the contribution of the positive class to the entropy of the dataset.

Second Term :
−n n
log ( )
p+n 2 p+n

p+n
n
is the probability of an element being in the negative class.

calculates the information for this class.


n
log ( )
2
p+n

This term represents the contribution of the negative class to the entropy.

Overall Entropy:
The sum of these two terms gives the total entropy of the dataset, measuring its impurity or disorder.

Example :

30 positive cases (p = 30)


10 negative cases (n = 10)

In [2]: #positive class and negative class


positive = 30
negative = 10

In [3]: #probabilities
p_prob = positive / (positive + negative)
n_prob = negative / (positive + negative)

In [4]: #entropy
entropy = (-p_prob * math.log2(p_prob)) - (n_prob * math.log2(n_prob))
print("Entropy:", entropy)

Entropy: 0.8112781244591328

−p p n n
Entropy(S) = log ( ) − log ( )
2 2
p + n p + n p + n p + n

1. Calculate Probabilities:

Probability of positive case:

p 30 30
= = = 0.75
p + n 30 + 10 40

Probability of negative case:

n 10 10
= = = 0.25
p + n 30 + 10 40

2. Apply Entropy Formula:

Entropy S
−30 30 10 10
= log ( ) − log ( )
40 2 40 40 2 40

Simplify: Entropy S = −0.75 log (0.75) − 0.25 log (0.25)


2 2

3. Calculate Logarithmic Values:

log (0.75) ≈ −0.415


2

log (0.25) ≈ −2
2

4. Substitute and Calculate:

file:///C:/Users/heman/Downloads/Understanding Decision Trees.html 4/10


29/12/2023, 00:11 Understanding Decision Trees

Entropy S = −0.75 × −0.415 − 0.25 × −2

Simplify: Entropy S ≈ 0.31125 + 0.5

Entropy S ≈ 0.81125

Example 2

Our dataset consists of three attributes: A1 represents the type of destination (e.g., Beach, Mountains), A2 represents the average temperature
of the destination (Hot or Cold), and Class indicates whether the customer enjoyed their vacation, with values Yes (enjoyed) or No (did not
enjoy).

columns: A1, A2, and Class. Each column represents a distinct feature with categorical values.

A1: This column contains categorical color data with possible values Red and Blue. It represents various types of vacation destinations.
A2: This column represents temperature categories with possible values Hot and Cold. It tells us the temperature climate of the vacation
destination.
Class: The target variable, indicating a binary outcome with possible values Yes and No. It indicates whether the customers enjoyed their
vacation experience.

In [5]: dataset = {
"A1": ["Red", "Red", "Blue", "Blue", "Red"],
"A2": ["Hot", "Cold", "Hot", "Cold", "Hot"],
"Class": ["Yes", "No", "Yes", "Yes", "No"]
}

In [6]: df = pd.DataFrame(dataset)

In [7]: #entropy of a binary set


def entropy_binary(p, n):
#handle the case where there is no impurity
if p == 0 or n == 0:
return 0
total = p + n
return -(p/total) * math.log2(p/total) - (n/total) * math.log2(n/total)

In [8]: def entropy(columns ,string):


print(f'Enrtopy (S) of {columns} : {string:.4f}')

In [9]: print(df)

A1 A2 Class
0 Red Hot Yes
1 Red Cold No
2 Blue Hot Yes
3 Blue Cold Yes
4 Red Hot No

Class Entropy

we count how many times each outcome appears in the Class category of our dataset.
We find that Yes appears 3 times and No appears 2 times.
We then add these counts together to get the total number of entries in the Class category, which in this case is 5 (3 Yes and 2 No).
The concept of entropy is a way to measure how mixed or uncertain the Class category is.
High entropy means that the outcomes are very mixed (like having an equal number of Yes and No), indicating more unpredictability.
Low entropy means that the outcomes are not very mixed (like having mostly Yes and very few No), indicating less unpredictability.

In [10]: #class counts


total_yes = 3
total_no = 2
total_class_count = total_yes + total_no

In [11]: #entropy of the dataset


entropy_s = entropy_binary(total_yes, total_no)
entropy("Class", entropy_s)

Enrtopy (S) of Class : 0.9710

An entropy value of 0.9710 is relatively high, suggesting a significant level of mixture or diversity in the outcomes. It implies that the Class
category contains a fairly balanced mix of Yes and No outcomes, but not perfectly balanced. If it were perfectly balanced, the entropy would
be closer to 1.

A1 Red Entropy

In [12]: #entropy for A1


#Red: 2 Yes, 1 No
a1_red_yes = 2
a1_red_no = 1

file:///C:/Users/heman/Downloads/Understanding Decision Trees.html 5/10


29/12/2023, 00:11 Understanding Decision Trees

entropy_red = entropy_binary(a1_red_yes, a1_red_no)


entropy("A1 RED", entropy_red)

Enrtopy (S) of A1 RED : 0.9183

We count how many times Yes and No occur when A1 is Red.


An entropy of 0.9183 is quite high on a scale from 0 to 1, where 0 represents no uncertainty (all outcomes are the same) and 1 represents
maximum uncertainty (a perfect split between outcomes).

A1 Blue Entropy

In [13]: #Blue: 1 Yes, 1 No


a1_blue_yes = 1
a1_blue_no = 1
entropy_blue = entropy_binary(a1_blue_yes, a1_blue_no)
entropy("A1 BLUE", entropy_blue)

Enrtopy (S) of A1 BLUE : 1.0000

Since both Yes and No occur equally (1 time each) for Blue, the entropy is expected to be at its maximum.
An entropy of 1.0 for A1 BLUE indicates that the outcomes are perfectly balanced and hence highly unpredictable.

Average Information Entropy for A1

p i + ni
I (Attribute) = ∑ Entropy(A)
p + n

In [14]: #average information entropy for A1


i_a1 = (total_yes/total_class_count) * entropy_red + (total_no/total_class_count) * entropy_blue

print(i_a1)

0.9509775004326937

Average entropy, we get a sense of how much A1 as a whole contributes to predicting the Class.
Value being close to 1, suggests a relatively high level of uncertainty or diversity in the Class outcomes as explained by the A1 feature
alone.

A2 Hot Entropy

In [15]: #entropy for A2


# A2 = Hot: 1 Yes, 2 No
hot_yes = 1
hot_no = 2

entropy_hot = entropy_binary(hot_yes, hot_no)


entropy("A2 HOT", entropy_hot)

Enrtopy (S) of A2 HOT : 0.9183

A2 Cold Entropy

In [16]: # A2 = Cold: 2 Yes, 0 No


cold_yes = 2
cold_no = 0
entropy_cold = entropy_binary(cold_yes , cold_no)
entropy("A2 COLD", entropy_cold)

Enrtopy (S) of A2 COLD : 0.0000

The entropy for the category Hot in A2 is 0.9183, which is relatively high and entropy for the category Cold in A2 is 0, indicating no
uncertainty at all.

This high entropy value suggests that when A2 is Hot, there's a significant level of uncertainty in predicting whether the class will be Yes or
No. In other words, Hot does not strongly lean towards a single Class outcome, as it is associated with Yes once and No twice.

This zero entropy value implies that when A2 is Cold, the class outcome is completely predictable. In your dataset, Cold is always
associated with Yes, and there are no instances of No for Cold. knowing that A2 is Cold gives a clear indication that the class will be 'Yes'.

Average Information Entropy for A2

file:///C:/Users/heman/Downloads/Understanding Decision Trees.html 6/10


29/12/2023, 00:11 Understanding Decision Trees

p i + ni
I (Attribute) = ∑ Entropy(A)
p + n

In [17]: #average information entropy for A2


i_a2 = (total_yes/total_class_count) * entropy_hot+ (total_no/total_class_count) * entropy_cold

print(i_a2)

0.5509775004326937

Information Gain for each Attribute

Difference in Entropy before and after splitting dataset on attribute A

Gain = Entropy(S) − I (Attribute)

In [18]: #information gain for each attribute


gain_a1 = entropy_s - i_a1
gain_a2 = entropy_s - i_a2

In [19]: entropy_s, gain_a1, gain_a2

Out[19]: (0.9709505944546686, 0.01997309402197489, 0.4199730940219749)

gain_a1 is 0.01997309402197489, indicating how much considering color reduces the confusion about customer satisfaction.
gain_a2 is 0.4199730940219749, Indicate how much considering temperature reduces the confusion about customer satisfaction.

Decision Tree Library


In [20]: #loading machine learning library
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split

In [21]: #creating model


clf = DecisionTreeClassifier()

In [22]: #printing data frame


df

Out[22]: A1 A2 Class

0 Red Hot Yes

1 Red Cold No

2 Blue Hot Yes

3 Blue Cold Yes

4 Red Hot No

In [23]: #encode categorical variables


df_encoded = pd.get_dummies(df, columns=["A1", "A2"])

In [24]: df_encoded

Out[24]: Class A1_Blue A1_Red A2_Cold A2_Hot

0 Yes False True False True

1 No False True True False

2 Yes True False False True

3 Yes True False True False

4 No False True False True

Feature Selection

In [25]: X = df_encoded.drop('Class', axis=1)


X

file:///C:/Users/heman/Downloads/Understanding Decision Trees.html 7/10


29/12/2023, 00:11 Understanding Decision Trees

Out[25]: A1_Blue A1_Red A2_Cold A2_Hot

0 False True False True

1 False True True False

2 True False False True

3 True False True False

4 False True False True

In [26]: y = df_encoded['Class'].values
y

Out[26]: array(['Yes', 'No', 'Yes', 'Yes', 'No'], dtype=object)

Spliting into Train and Test Set

In [27]: #80 percent for training and 20 percent for test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [28]: X_train

Out[28]: A1_Blue A1_Red A2_Cold A2_Hot

0 False True False True

1 False True True False

3 True False True False

4 False True False True

In [29]: X_test

Out[29]: A1_Blue A1_Red A2_Cold A2_Hot

2 True False False True

Training Model

In [30]: clf.fit(X_train, y_train)

Out[30]: ▾ DecisionTreeClassifier

DecisionTreeClassifier()

Accuracy of Model

In [31]: print("Training Accuracy:", clf.score(X_train, y_train))


print("Testing Accuracy:", clf.score(X_test, y_test))

Training Accuracy: 0.75


Testing Accuracy: 1.0

In [32]: print("Classes:", clf.classes_)


print("Feature Importances:", clf.feature_importances_)
print("Feature Names:", clf.feature_names_in_)
print("Number of Classes:", clf.n_classes_)
print("Number of Outputs:", clf.n_outputs_)
print("Number of Leaves:", clf.get_n_leaves())

Classes: ['No' 'Yes']


Feature Importances: [0.66666667 0. 0. 0.33333333]
Feature Names: ['A1_Blue' 'A1_Red' 'A2_Cold' 'A2_Hot']
Number of Classes: 2
Number of Outputs: 1
Number of Leaves: 3

In [33]: n_nodes = clf.tree_.node_count


children_left = clf.tree_.children_left
children_right = clf.tree_.children_right
impurity = clf.tree_.impurity

print("Node count:", n_nodes)

for i in range(n_nodes):
if children_left[i] != children_right[i]:
print(f"Node {i} entropy: {impurity[i]}")

file:///C:/Users/heman/Downloads/Understanding Decision Trees.html 8/10


29/12/2023, 00:11 Understanding Decision Trees

Node count: 5
Node 0 entropy: 0.5
Node 1 entropy: 0.4444444444444444

Decision Tree Diagram

In [34]: #plotting decsion tree diagram


plt.figure(figsize=(6, 6))
plot_tree(clf, filled=True, feature_names=X.columns, class_names=["enjoyed", "not enjoyed"], rounded=True, fontsize=8)
plt.show()

In [35]: df

Out[35]: A1 A2 Class

0 Red Hot Yes

1 Red Cold No

2 Blue Hot Yes

3 Blue Cold Yes

4 Red Hot No

In [36]: def entropy(y):


if len(y) == 0:
return 0
p = sum(y == "Yes") / len(y)
n = sum(y == "No") / len(y)
if p == 0 or n == 0:
return 0
return -(p * math.log2(p) + n * math.log2(n))

In [37]: #entropy of the dataset


overall_entropy = entropy(df['Class'])

In [38]: print("Overall Entropy of dataset:", overall_entropy)

Overall Entropy of dataset: 0.9709505944546686

In [39]: #entropy for each attribute


entropies = {}
for i in ["A1", "A2"]:
unique_values = df[i].unique()
entropy_sum = 0
for value in unique_values:
subset = df[df[i] == value]['Class']
entropy_sum += (len(subset) / len(df)) * entropy(subset)
entropies[i] = entropy_sum

In [40]: print(entropies)

{'A1': 0.5509775004326937, 'A2': 0.9509775004326937}

In [41]: print("Entropy for each attribute A1:", entropies['A1'])

Entropy for each attribute A1: 0.5509775004326937

file:///C:/Users/heman/Downloads/Understanding Decision Trees.html 9/10


29/12/2023, 00:11 Understanding Decision Trees

In [42]: print("Entropy for each attribute A2:", entropies['A2'])

Entropy for each attribute A2: 0.9509775004326937

References:

1. Decision Tree - GeeksforGeeks

2. Decision Trees — scikit-learn Documentation

3. Morgan, J. P. (1964, July). Decision Trees for Decision Making. Harvard Business Rview.

file:///C:/Users/heman/Downloads/Understanding Decision Trees.html 10/10

You might also like