Breaking Down Decision Tree Algorithm

29/12/2023, 00:11 Understanding Decision Trees
Decision Tree
Hemant Thapa
Importing Libraries
In [1]: import math
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
Decision Tree Algorithm

A Decision Tree stands out as a widely used and robust machine learning technique that I've had the opportunity to study. It falls under the
category of supervised learning methods that do not rely on specific assumptions about the data and can be applied to both classification and
regression tasks. The core idea is to construct a model that predicts the outcome of a target variable by acquiring straightforward decision
rules from the characteristics of the data. In classification, the target values are distinct categories, while in regression, they are continuous
numerical values.ble!
Decision Trees consist of a tree-like structure where each internal node represents a decision rule based on one of the data features, and each
leaf node corresponds to the predicted outcome or class. The process of constructing a Decision Tree involves selecting the best features to
split the data and optimizing the decision rules at each node to maximize predictive accuracy.
1. Root Node:
The root node is the topmost node in a decision tree.
It represents the entire dataset or a subset of the data at the beginning of the tree-building process.
The root node serves as the starting point for the tree's construction.
At the root node, a decision is made to split the data into subsets based on the values of a specific feature (attribute). This feature is
chosen because it results in the best separation of data according to a certain criterion (e.g., Gini impurity, information gain, variance
reduction).
It's like the most important question we ask when trying to make decisions. tThe root node helps us decide how to divide our data into
smaller groups (subsets).
file:///C:/Users/heman/Downloads/Understanding Decision Trees.html 1/10

Question 1 - Which feature (attribute) should we use to split our data?
Question 2 - What's the best way to divide our data into more manageable groups?
Answer 1 - We choose the feature (attribute) that provides the most useful information for making decisions.
Answer 2 - This feature is selected based on the highest Information Gain or Gini Gain, which helps us make the best possible split.
2. Decision Nodes:
Decision nodes are the internal nodes of the decision tree, situated between the root node and the terminal nodes.
Each decision node represents a specific decision or condition based on a feature's value.
These nodes serve as points where the dataset is split into subsets according to the decision or condition.
The decision node's role is to determine which branch of the tree to follow based on whether the condition is true or false for a particular
data point.
Decision nodes are like the middle managers in our decision-making process. they help us decide whether we need to split our data into
more detailed groups or not. decision nodes act as traffic controllers for data, guiding it down different paths.
3. Terminal Nodes (Leaf Nodes):

Terminal nodes, often referred to as leaf nodes, are the endpoints of the decision tree.
They do not contain any further decisions or splits.
Terminal nodes represent the final predicted outcome or class for a specific subset of data that has reached that point in the tree.
In classification tasks, each leaf node corresponds to a class label, indicating the predicted class for data points that fall into that node's
subset. In regression tasks, leaf nodes contain numerical values representing predictions.
Leaf nodes treated as the final outcomes or decisions in our decision tree. They provide us with the answer or prediction.
4. Gini Impurity, Information Gain and Variance Reduction (for Regression)
Gini Impurity:
Gini impurity is a measure of the disorder or impurity within a dataset.
In the context of a decision tree, it quantifies the probability of incorrectly classifying a randomly chosen element from the dataset.
A Gini impurity of 0 indicates that the dataset is perfectly pure, meaning all elements belong to the same class. A Gini impurity of 1 means
that the dataset is completely impure, with elements evenly distributed among different classes.
When deciding how to split a dataset at a decision node, the algorithm calculates the Gini impurity for each possible split and selects the
split that minimises the impurity, resulting in a more homogeneous subset.
Information Gain:
Information gain measures the reduction in uncertainty or entropy achieved by a particular split in the data.
Entropy is a concept from information theory that quantifies the disorder or randomness in a dataset. High entropy indicates high
disorder, while low entropy indicates order or purity.
Information gain calculates the difference in entropy before and after a split. A high information gain means that the split results in a
more orderly separation of data.
Decision tree algorithms aim to maximise information gain when choosing which feature to split on. They select the feature that leads to
the greatest reduction in uncertainty or entropy within the child nodes.
Variance Reduction (for Regression):

While Gini impurity and information gain are primarily used for classification tasks, variance reduction is specifically used in regression
tasks.
Variance reduction measures the reduction in variance of the target variable within a dataset.
The goal in regression decision trees is to minimise the variance of the target variable within each child node. A lower variance indicates
that the predicted values within the node are more consistent and accurate.
When deciding how to split a dataset in a regression tree, the algorithm calculates the variance of the target variable for each possible
split and selects the split that results in the greatest reduction in variance.
5. Advantages:
Interpretability: Decision Trees offer a clear and interpretable model, making it easy to understand why a particular decision was made.
This transparency is crucial for applications where understanding the reasoning behind predictions is essential, such as in healthcare and
finance.
Versatility: Decision Trees can handle both categorical and numerical data, as well as mixed data types. They are robust against outliers
and missing values, making them suitable for real-world datasets.
Feature Importance: Decision Trees provide a measure of feature importance, which helps identify which features have the most
significant impact on predictions. This information can guide feature selection and engineering efforts.

6. Challenges:
Overfitting: Decision Trees are prone to overfitting, where the model captures noise in the training data, leading to poor generalisation on
unseen data. Techniques like pruning and setting a maximum depth are used to mitigate overfitting.
Instability: Small changes in the data can result in different tree structures, making Decision Trees somewhat unstable. Ensemble methods
like Random Forest and Gradient Boosting address this issue by combining multiple trees to improve stability and predictive performance.
7. Popular types of decision tree algorithms
ID3 (Iterative Dichotomiser 3):
ID3 is one of the earliest decision tree algorithms developed by Ross Quinlan. It is primarily used for classification tasks.
ID3 uses the entropy and information gain concepts to determine the best features for splitting the data at each node.
It builds a tree by recursively selecting attributes that result in the most significant information gain.
CART (Classification And Regression Trees):
CART is a versatile decision tree algorithm that can be used for both classification and regression problems.
Instead of entropy, CART uses Gini impurity to evaluate the quality of splits in classification tasks. For regression, it assesses the mean
squared error reduction.
CART constructs binary trees, meaning that each internal node has exactly two child nodes.
Chi-Square (χ²) Automatic Interaction Detection (CHAID):
CHAID is a decision tree algorithm primarily used for categorical target variables and categorical predictors.
It employs statistical tests like the Chi-Square test to determine the best splits based on the significance of relationships between
variables.
CHAID is particularly useful for exploring interactions between categorical variables.
Reduction in Variance (for Regression):
This method is specific to regression problems and is often used with algorithms like C4.5 or CART.
Instead of impurity measures, it focuses on reducing the variance of the target variable within each split.
In each node, it aims to find a split that results in the smallest total variance within the child nodes, leading to a more accurate regression
model.
8. ID3 (Iterative Dichotomiser)
8.1 Homogeneity
Homogeneity is the quality or state of being uniform, consistent, or similar in nature. In various contexts, homogeneity implies that the
elements or components within a group or system are alike or exhibit a high degree of similarity with respect to a particular characteristic
or property.
Homogeneity of variance refers to the assumption that the variances (the spread or dispersion of data) are approximately equal across
different groups or samples being compared. Violations of this assumption can impact the validity of statistical tests.
we can calculate the variance or standard deviation of the data points within each group. If the variances are similar or within an
acceptable range, it suggests homogeneity. You may also use graphical methods such as histograms or box plots to visualise the
distribution of data within each group.
8.2 What is “Entropy”? and What is its function?
Entropy is used as a measure of impurity or disorder within a dataset. It helps decide how to split data at decision nodes to create more
homogeneous subsets.
The function of entropy is to guide the construction of decision trees by identifying features that result in the greatest reduction of
uncertainty (entropy) in the child nodes after a split. It helps select the most informative features for classification or regression tasks.
Entropy(s)
The more uncertain or unpredictable the outcomes, the higher the entropy.
H (S) = − ∑ pi log (pi )

2
i=1

−p p n n
Entropy(S) = log ( ) − log ( )
2 2
p + n p + n p + n p + n
Entropy(S) represents the entropy of the dataset S.

p is the number of positive instances (e.g., instances belonging to a particular class) in the dataset S.
n is the number of negative instances (e.g., instances not belonging to that class) in the dataset S.
log2 (x) represents the logarithm of x with base 2.
Dataset S :
Contains elements of two classes, positive (p) and negative (n).
First Term :
−p p
log ( )
p+n 2 p+n
is the probability of an element being in the positive class.

p
p+n
p
log (
2 p+n
) calculates the information (in bits) for this class.
The whole term represents the contribution of the positive class to the entropy of the dataset.
Second Term :
−n n
log ( )
p+n 2 p+n
p+n
n
is the probability of an element being in the negative class.
calculates the information for this class.

n
log ( )
2
p+n
This term represents the contribution of the negative class to the entropy.
Overall Entropy:
The sum of these two terms gives the total entropy of the dataset, measuring its impurity or disorder.
Example :
30 positive cases (p = 30)

10 negative cases (n = 10)
In [2]: #positive class and negative class

positive = 30
negative = 10
In [3]: #probabilities
p_prob = positive / (positive + negative)
n_prob = negative / (positive + negative)
In [4]: #entropy
entropy = (-p_prob * math.log2(p_prob)) - (n_prob * math.log2(n_prob))
print("Entropy:", entropy)
Entropy: 0.8112781244591328
−p p n n
Entropy(S) = log ( ) − log ( )
2 2
p + n p + n p + n p + n
1. Calculate Probabilities:
Probability of positive case:
p 30 30
= = = 0.75
p + n 30 + 10 40
Probability of negative case:
n 10 10
= = = 0.25
p + n 30 + 10 40
2. Apply Entropy Formula:
Entropy S
−30 30 10 10
= log ( ) − log ( )
40 2 40 40 2 40
Simplify: Entropy S = −0.75 log (0.75) − 0.25 log (0.25)

2 2
3. Calculate Logarithmic Values:
log (0.75) ≈ −0.415

2
log (0.25) ≈ −2
2
4. Substitute and Calculate:

Entropy S = −0.75 × −0.415 − 0.25 × −2
Simplify: Entropy S ≈ 0.31125 + 0.5
Entropy S ≈ 0.81125
Example 2
Our dataset consists of three attributes: A1 represents the type of destination (e.g., Beach, Mountains), A2 represents the average temperature
of the destination (Hot or Cold), and Class indicates whether the customer enjoyed their vacation, with values Yes (enjoyed) or No (did not
enjoy).
columns: A1, A2, and Class. Each column represents a distinct feature with categorical values.
A1: This column contains categorical color data with possible values Red and Blue. It represents various types of vacation destinations.
A2: This column represents temperature categories with possible values Hot and Cold. It tells us the temperature climate of the vacation
destination.
Class: The target variable, indicating a binary outcome with possible values Yes and No. It indicates whether the customers enjoyed their
vacation experience.
In [5]: dataset = {
"A1": ["Red", "Red", "Blue", "Blue", "Red"],
"A2": ["Hot", "Cold", "Hot", "Cold", "Hot"],
"Class": ["Yes", "No", "Yes", "Yes", "No"]
}
In [6]: df = pd.DataFrame(dataset)
In [7]: #entropy of a binary set

def entropy_binary(p, n):
#handle the case where there is no impurity
if p == 0 or n == 0:
return 0
total = p + n
return -(p/total) * math.log2(p/total) - (n/total) * math.log2(n/total)
In [8]: def entropy(columns ,string):

print(f'Enrtopy (S) of {columns} : {string:.4f}')
In [9]: print(df)
A1 A2 Class
0 Red Hot Yes
1 Red Cold No
2 Blue Hot Yes
3 Blue Cold Yes
4 Red Hot No
Class Entropy
we count how many times each outcome appears in the Class category of our dataset.
We find that Yes appears 3 times and No appears 2 times.
We then add these counts together to get the total number of entries in the Class category, which in this case is 5 (3 Yes and 2 No).
The concept of entropy is a way to measure how mixed or uncertain the Class category is.
High entropy means that the outcomes are very mixed (like having an equal number of Yes and No), indicating more unpredictability.
Low entropy means that the outcomes are not very mixed (like having mostly Yes and very few No), indicating less unpredictability.
In [10]: #class counts

total_yes = 3
total_no = 2
total_class_count = total_yes + total_no
In [11]: #entropy of the dataset

entropy_s = entropy_binary(total_yes, total_no)
entropy("Class", entropy_s)
Enrtopy (S) of Class : 0.9710
An entropy value of 0.9710 is relatively high, suggesting a significant level of mixture or diversity in the outcomes. It implies that the Class
category contains a fairly balanced mix of Yes and No outcomes, but not perfectly balanced. If it were perfectly balanced, the entropy would
be closer to 1.
A1 Red Entropy
In [12]: #entropy for A1

#Red: 2 Yes, 1 No
a1_red_yes = 2
a1_red_no = 1

entropy_red = entropy_binary(a1_red_yes, a1_red_no)

entropy("A1 RED", entropy_red)
Enrtopy (S) of A1 RED : 0.9183
We count how many times Yes and No occur when A1 is Red.

An entropy of 0.9183 is quite high on a scale from 0 to 1, where 0 represents no uncertainty (all outcomes are the same) and 1 represents
maximum uncertainty (a perfect split between outcomes).
A1 Blue Entropy
In [13]: #Blue: 1 Yes, 1 No

a1_blue_yes = 1
a1_blue_no = 1
entropy_blue = entropy_binary(a1_blue_yes, a1_blue_no)
entropy("A1 BLUE", entropy_blue)
Enrtopy (S) of A1 BLUE : 1.0000
Since both Yes and No occur equally (1 time each) for Blue, the entropy is expected to be at its maximum.
An entropy of 1.0 for A1 BLUE indicates that the outcomes are perfectly balanced and hence highly unpredictable.
Average Information Entropy for A1
p i + ni
I (Attribute) = ∑ Entropy(A)
p + n
In [14]: #average information entropy for A1

i_a1 = (total_yes/total_class_count) * entropy_red + (total_no/total_class_count) * entropy_blue
print(i_a1)
0.9509775004326937
Average entropy, we get a sense of how much A1 as a whole contributes to predicting the Class.
Value being close to 1, suggests a relatively high level of uncertainty or diversity in the Class outcomes as explained by the A1 feature
alone.
A2 Hot Entropy
In [15]: #entropy for A2

# A2 = Hot: 1 Yes, 2 No
hot_yes = 1
hot_no = 2
entropy_hot = entropy_binary(hot_yes, hot_no)

entropy("A2 HOT", entropy_hot)
Enrtopy (S) of A2 HOT : 0.9183
A2 Cold Entropy
In [16]: # A2 = Cold: 2 Yes, 0 No

cold_yes = 2
cold_no = 0
entropy_cold = entropy_binary(cold_yes , cold_no)
entropy("A2 COLD", entropy_cold)
Enrtopy (S) of A2 COLD : 0.0000
The entropy for the category Hot in A2 is 0.9183, which is relatively high and entropy for the category Cold in A2 is 0, indicating no
uncertainty at all.
This high entropy value suggests that when A2 is Hot, there's a significant level of uncertainty in predicting whether the class will be Yes or
No. In other words, Hot does not strongly lean towards a single Class outcome, as it is associated with Yes once and No twice.
This zero entropy value implies that when A2 is Cold, the class outcome is completely predictable. In your dataset, Cold is always
associated with Yes, and there are no instances of No for Cold. knowing that A2 is Cold gives a clear indication that the class will be 'Yes'.
Average Information Entropy for A2

p i + ni
I (Attribute) = ∑ Entropy(A)
p + n
In [17]: #average information entropy for A2

i_a2 = (total_yes/total_class_count) * entropy_hot+ (total_no/total_class_count) * entropy_cold
print(i_a2)
0.5509775004326937
Information Gain for each Attribute
Difference in Entropy before and after splitting dataset on attribute A
Gain = Entropy(S) − I (Attribute)
In [18]: #information gain for each attribute

gain_a1 = entropy_s - i_a1
gain_a2 = entropy_s - i_a2
In [19]: entropy_s, gain_a1, gain_a2
Out[19]: (0.9709505944546686, 0.01997309402197489, 0.4199730940219749)
gain_a1 is 0.01997309402197489, indicating how much considering color reduces the confusion about customer satisfaction.
gain_a2 is 0.4199730940219749, Indicate how much considering temperature reduces the confusion about customer satisfaction.
Decision Tree Library

In [20]: #loading machine learning library
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
In [21]: #creating model

clf = DecisionTreeClassifier()
In [22]: #printing data frame

df
Out[22]: A1 A2 Class
0 Red Hot Yes
1 Red Cold No
2 Blue Hot Yes
3 Blue Cold Yes
4 Red Hot No
In [23]: #encode categorical variables

df_encoded = pd.get_dummies(df, columns=["A1", "A2"])
In [24]: df_encoded
Out[24]: Class A1_Blue A1_Red A2_Cold A2_Hot
0 Yes False True False True
1 No False True True False
2 Yes True False False True
3 Yes True False True False
4 No False True False True
Feature Selection
In [25]: X = df_encoded.drop('Class', axis=1)

X

Out[25]: A1_Blue A1_Red A2_Cold A2_Hot
0 False True False True
1 False True True False
2 True False False True
3 True False True False
In [26]: y = df_encoded['Class'].values
y
Out[26]: array(['Yes', 'No', 'Yes', 'Yes', 'No'], dtype=object)
Spliting into Train and Test Set
In [27]: #80 percent for training and 20 percent for test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
In [28]: X_train
1 False True True False
3 True False True False
In [29]: X_test
2 True False False True
Training Model
In [30]: clf.fit(X_train, y_train)
Out[30]: ▾ DecisionTreeClassifier
DecisionTreeClassifier()
Accuracy of Model
In [31]: print("Training Accuracy:", clf.score(X_train, y_train))

print("Testing Accuracy:", clf.score(X_test, y_test))
Training Accuracy: 0.75

Testing Accuracy: 1.0
In [32]: print("Classes:", clf.classes_)

print("Feature Importances:", clf.feature_importances_)
print("Feature Names:", clf.feature_names_in_)
print("Number of Classes:", clf.n_classes_)
print("Number of Outputs:", clf.n_outputs_)
print("Number of Leaves:", clf.get_n_leaves())
Classes: ['No' 'Yes']

Feature Importances: [0.66666667 0. 0. 0.33333333]
Feature Names: ['A1_Blue' 'A1_Red' 'A2_Cold' 'A2_Hot']
Number of Classes: 2
Number of Outputs: 1
Number of Leaves: 3
In [33]: n_nodes = clf.tree_.node_count

children_left = clf.tree_.children_left
children_right = clf.tree_.children_right
impurity = clf.tree_.impurity
print("Node count:", n_nodes)
for i in range(n_nodes):
if children_left[i] != children_right[i]:
print(f"Node {i} entropy: {impurity[i]}")

Node count: 5
Node 0 entropy: 0.5
Node 1 entropy: 0.4444444444444444
Decision Tree Diagram
In [34]: #plotting decsion tree diagram

plt.figure(figsize=(6, 6))
plot_tree(clf, filled=True, feature_names=X.columns, class_names=["enjoyed", "not enjoyed"], rounded=True, fontsize=8)
plt.show()
In [35]: df
Out[35]: A1 A2 Class
0 Red Hot Yes
1 Red Cold No
2 Blue Hot Yes
3 Blue Cold Yes
4 Red Hot No
In [36]: def entropy(y):

if len(y) == 0:
return 0
p = sum(y == "Yes") / len(y)
n = sum(y == "No") / len(y)
if p == 0 or n == 0:
return 0
return -(p * math.log2(p) + n * math.log2(n))
In [37]: #entropy of the dataset

overall_entropy = entropy(df['Class'])
In [38]: print("Overall Entropy of dataset:", overall_entropy)
Overall Entropy of dataset: 0.9709505944546686
In [39]: #entropy for each attribute

entropies = {}
for i in ["A1", "A2"]:
unique_values = df[i].unique()
entropy_sum = 0
for value in unique_values:
subset = df[df[i] == value]['Class']
entropy_sum += (len(subset) / len(df)) * entropy(subset)
entropies[i] = entropy_sum
In [40]: print(entropies)
{'A1': 0.5509775004326937, 'A2': 0.9509775004326937}
In [41]: print("Entropy for each attribute A1:", entropies['A1'])
Entropy for each attribute A1: 0.5509775004326937

In [42]: print("Entropy for each attribute A2:", entropies['A2'])
Entropy for each attribute A2: 0.9509775004326937
References:
1. Decision Tree - GeeksforGeeks
2. Decision Trees — scikit-learn Documentation
3. Morgan, J. P. (1964, July). Decision Trees for Decision Making. Harvard Business Rview.

Breaking Down Decision Tree Algorithm

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Breaking Down Decision Tree Algorithm

Uploaded by

Copyright:

Available Formats

29/12/2023, 00:11 Understanding Decision Trees

Decision Tree Algorithm

file:///C:/Users/heman/Downloads/Understanding Decision Trees.html 1/10

Question 1 - Which feature (attribute) should we use to split our data?

3. Terminal Nodes (Leaf Nodes):

4. Gini Impurity, Information Gain and Variance Reduction (for Regression)

Variance Reduction (for Regression):

file:///C:/Users/heman/Downloads/Understanding Decision Trees.html 2/10

7. Popular types of decision tree algorithms

ID3 (Iterative Dichotomiser 3):

CART (Classification And Regression Trees):

Chi-Square (χ²) Automatic Interaction Detection (CHAID):

Reduction in Variance (for Regression):

8. ID3 (Iterative Dichotomiser)

8.2 What is “Entropy”? and What is its function?

H (S) = − ∑ pi log (pi )

file:///C:/Users/heman/Downloads/Understanding Decision Trees.html 3/10

Entropy(S) represents the entropy of the dataset S.

Contains elements of two classes, positive (p) and negative (n).

is the probability of an element being in the positive class.

calculates the information for this class.

30 positive cases (p = 30)

In [2]: #positive class and negative class

Probability of positive case:

Probability of negative case:

2. Apply Entropy Formula:

Simplify: Entropy S = −0.75 log (0.75) − 0.25 log (0.25)

3. Calculate Logarithmic Values:

log (0.75) ≈ −0.415

4. Substitute and Calculate:

file:///C:/Users/heman/Downloads/Understanding Decision Trees.html 4/10

Entropy S = −0.75 × −0.415 − 0.25 × −2

Simplify: Entropy S ≈ 0.31125 + 0.5

In [7]: #entropy of a binary set

In [8]: def entropy(columns ,string):

In [10]: #class counts

In [11]: #entropy of the dataset

Enrtopy (S) of Class : 0.9710

In [12]: #entropy for A1

file:///C:/Users/heman/Downloads/Understanding Decision Trees.html 5/10

entropy_red = entropy_binary(a1_red_yes, a1_red_no)

Enrtopy (S) of A1 RED : 0.9183

We count how many times Yes and No occur when A1 is Red.

In [13]: #Blue: 1 Yes, 1 No

Enrtopy (S) of A1 BLUE : 1.0000

Average Information Entropy for A1

In [14]: #average information entropy for A1

In [15]: #entropy for A2

entropy_hot = entropy_binary(hot_yes, hot_no)

Enrtopy (S) of A2 HOT : 0.9183

In [16]: # A2 = Cold: 2 Yes, 0 No

Enrtopy (S) of A2 COLD : 0.0000

Average Information Entropy for A2

file:///C:/Users/heman/Downloads/Understanding Decision Trees.html 6/10

In [17]: #average information entropy for A2

Information Gain for each Attribute

Difference in Entropy before and after splitting dataset on attribute A

Gain = Entropy(S) − I (Attribute)

In [18]: #information gain for each attribute

In [19]: entropy_s, gain_a1, gain_a2

Out[19]: (0.9709505944546686, 0.01997309402197489, 0.4199730940219749)

Decision Tree Library

In [21]: #creating model

In [22]: #printing data frame