Professional Documents
Culture Documents
Gini Impurity
A measurement used to build Decision Trees to determine how the
features of a dataset should split nodes to form the tree.
Contents Index +
https://www.learndatasci.com/glossary/gini-impurity/ 1/9
20.10.22, 20:25 Gini Impurity – LearnDataSci
Get updates in
your inbox
Join over 7,500 data science learners.
For example, say you want to build a classifier that determines if someone
will default on their credit card. You have some labeled data with features,
such as bins for age, income, credit rating, and whether or not each person
is a student. To find the best feature for the first split of the tree – the root
node – you could calculate how poorly each feature divided the data into
the correct class, default ("yes") or didn't default ("no"). This calculation
would measure the impurity of the split, and the feature with the lowest
impurity would determine the best feature for splitting the current node.
This process would continue for each subsequent node using the
remaining features.
Mathematical definition
https://www.learndatasci.com/glossary/gini-impurity/ 2/9
20.10.22, 20:25 Gini Impurity – LearnDataSci
Get updates in
your inbox
The node with uniform class distribution has the highest impurity. The
minimum impurity is obtained when all records belong to the same class.
Several examples are given in Join
the following table
over 7,500 to science
data demonstrate the Gini
learners.
Impurity computation.
Enter your email Subscribe
Node A 0 10 0 1
An attribute with the smallest Gini Impurity is selected for splitting the
node.
When training a decision tree, the attribute that provides the smallest
is chosen to split the node.
Python Example
https://www.learndatasci.com/glossary/gini-impurity/ 3/9
20.10.22, 20:25 Gini Impurity – LearnDataSci
# Import libraries
Get updates in
import numpy as np
import pandas as pd
import os
plt.figure()
x = np.linspace(0.01,1)
y = 1 - (x*x) - (1-x)*(1-x)
plt.plot(x,y)
plt.title('Gini Impurity')
plt.ylabel("Impurity Measure")
plt.xticks(np.arange(0,1.1,0.1))
plt.show()
RESULT:
https://www.learndatasci.com/glossary/gini-impurity/ 4/9
20.10.22, 20:25 Gini Impurity – LearnDataSci
This figure shows that Gini impurity is maximum for the 50-50 sample (
) and minimum for the homogeneous sample ( or )
Get updates in
Computation of Gini Impurity for a simple dataset
your inbox
This data set is used to predictJoin
whether
over a7,500
person willscience
data default learners.
on their credit
card. There are two classes ( default = 'yes', no_default = 'no' ):
Enter your email Subscribe
class_name = 'default'
data1 ={
print(df1)
OUT:
https://www.learndatasci.com/glossary/gini-impurity/ 5/9
20.10.22, 20:25 Gini Impurity – LearnDataSci
Get updates in
def gini_impurity (value_counts):
n = value_counts.sum()
p_sum = 0
return gini
Enter your email Subscribe
class_value_counts = df1[class_name].value_counts()
gini_class = gini_impurity(class_value_counts)
OUT:
yes 9
no 5
https://www.learndatasci.com/glossary/gini-impurity/ 6/9
20.10.22, 20:25 Gini Impurity – LearnDataSci
# STEP 2:
Get updates in
# Calculating gini impurity for the attiributes
def gini_split_a(attribute_name):
your inbox
attribute_values = df1[attribute_name].value_counts()
gini_A = 0
Enter
n_k = attribute_values[key]
your email Subscribe
n = df1.shape[0]
return gini_A
gini_attiribute ={}
gini_attiribute[key] = gini_split_a(key)
OUT:
# STEP 3:
min_value = min(gini_attiribute.values())
selected_attribute = min(gini_attiribute.keys())
https://www.learndatasci.com/glossary/gini-impurity/ 7/9
20.10.22, 20:25 Gini Impurity – LearnDataSci
OUT:
your inbox
The selected attiribute is : age
Get updates in
your inbox
Join over 7,500 data science learners.
Fatih Karabiber
Ph.D. in Computer Engineering, Data Scientist
Associate Professor of Computer Engineering. Author/co-author of over 30
journal publications. Instructor of graduate/undergraduate courses.
Supervisor of Graduate thesis. Consultant to IT Companies.
https://www.learndatasci.com/glossary/gini-impurity/ 8/9
20.10.22, 20:25 Gini Impurity – LearnDataSci
Data Science & Machine Learning Glossary Free Data Science Books
Privacy Policy
https://www.learndatasci.com/glossary/gini-impurity/ 9/9