You are on page 1of 9

20.10.

22, 20:25 Gini Impurity – LearnDataSci

You are reading Glossary / Machine Learning Algorithm

Author: Fatih Karabiber

Ph.D. in Computer Engineering, Data Scientist

Gini Impurity
A measurement used to build Decision Trees to determine how the
features of a dataset should split nodes to form the tree.

Contents Index +

What is Gini Impurity?


Gini Impurity is a measurement used to build Decision Trees to determine
how the features of a dataset should split nodes to form the tree. More
precisely, the Gini Impurity of a dataset is a number between 0-0.5, which
indicates the likelihood of new, random data being misclassified if it were
given a random class label according to the class distribution in the
dataset.

https://www.learndatasci.com/glossary/gini-impurity/ 1/9
20.10.22, 20:25 Gini Impurity – LearnDataSci

Get updates in
your inbox
Join over 7,500 data science learners.

Enter your email Subscribe

For example, say you want to build a classifier that determines if someone
will default on their credit card. You have some labeled data with features,
such as bins for age, income, credit rating, and whether or not each person
is a student. To find the best feature for the first split of the tree – the root
node – you could calculate how poorly each feature divided the data into
the correct class, default ("yes") or didn't default ("no"). This calculation
would measure the impurity of the split, and the feature with the lowest
impurity would determine the best feature for splitting the current node.
This process would continue for each subsequent node using the
remaining features.

In the image above, has minumum gini impurity, so is selected as


the root in the decision tree.

Mathematical definition

Consider a dataset that contains samples from classes. The probability


of samples belonging to class at a given node can be denoted as .
Then the Gini Impurity of is defined as:

https://www.learndatasci.com/glossary/gini-impurity/ 2/9
20.10.22, 20:25 Gini Impurity – LearnDataSci

Get updates in
your inbox
The node with uniform class distribution has the highest impurity. The
minimum impurity is obtained when all records belong to the same class.
Several examples are given in Join
the following table
over 7,500 to science
data demonstrate the Gini
learners.
Impurity computation.
Enter your email Subscribe

Count Probability Gini Impurity

Node A 0 10 0 1

Node B 3 7 0.3 0.7

Node C 5 5 0.5 0.5

An attribute with the smallest Gini Impurity is selected for splitting the
node.

If a data set is split on an attribute into two subsets and with


sizes and , respectively, the Gini Impurity can be defined as:

When training a decision tree, the attribute that provides the smallest
is chosen to split the node.

In order to obtain information gain for an attribute, the weighted impurities


of the branches is subtracted from the original impurity. The best split can
also be chosen by maximizing the Gini gain. Gini gain is calculated as
follows:

Python Example

https://www.learndatasci.com/glossary/gini-impurity/ 3/9
20.10.22, 20:25 Gini Impurity – LearnDataSci

# Import libraries

Get updates in
import numpy as np

import pandas as pd

import os

import matplotlib.pyplot as plt


your inbox
Join over 7,500 data science learners.
Learn Data Science with

Enter your email Subscribe


Visualizing Gini Impurity range

For a two class problem, Graph of impurity measures as a function of


probability of the first class.

#A figure is created to show Gini ımpurity measures

plt.figure()

x = np.linspace(0.01,1)

y = 1 - (x*x) - (1-x)*(1-x)

plt.plot(x,y)

plt.title('Gini Impurity')

plt.xlabel("Fraction of Class k ($p_k$)")

plt.ylabel("Impurity Measure")

plt.xticks(np.arange(0,1.1,0.1))

plt.show()

Learn Data Science with

RESULT:

https://www.learndatasci.com/glossary/gini-impurity/ 4/9
20.10.22, 20:25 Gini Impurity – LearnDataSci

This figure shows that Gini impurity is maximum for the 50-50 sample (
) and minimum for the homogeneous sample ( or )
Get updates in
Computation of Gini Impurity for a simple dataset
your inbox
This data set is used to predictJoin
whether
over a7,500
person willscience
data default learners.
on their credit
card. There are two classes ( default = 'yes', no_default = 'no' ):
Enter your email Subscribe

# Defining a simple dataset

attribute_names = ['age', 'income','student', 'credit_rate']

class_name = 'default'

data1 ={

'age' : ['youth', 'youth', 'middle_age', 'senior', 'senior', 'senior','middle_age


'income' : ['high', 'high', 'high', 'medium', 'low', 'low', 'low', 'medium','low'
'student' : ['no','no','no','no','yes','yes','yes','no','yes','yes','yes','no','ye
'credit_rate' : ['fair', 'excellent', 'fair', 'fair', 'fair', 'excellent', 'excel
'default' : ['no', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no', 'yes', 'yes','ye
}

df1 = pd.DataFrame (data1, columns=data1.keys())

print(df1)

OUT:

age income student credit_rate default

0 youth high no fair no

1 youth high no excellent no

2 middle_age high no fair yes

3 senior medium no fair yes

4 senior low yes fair yes

5 senior low yes excellent no

6 middle_age low yes excellent yes

7 youth medium no fair no

8 youth low yes fair yes

9 senior medium yes fair yes

10 youth medium yes excellent yes

11 middle_age medium no excellent yes

12 middle_age high yes fair yes

13 senior medium no excellent no

Learn Data Science with

https://www.learndatasci.com/glossary/gini-impurity/ 5/9
20.10.22, 20:25 Gini Impurity – LearnDataSci

# STEP 1: Calculate gini(D)

Get updates in
def gini_impurity (value_counts):

n = value_counts.sum()

p_sum = 0

for key in value_counts.keys():


your inbox
p_sum = p_sum + (value_counts[key] / n ) * (value_counts[key] / n )

Join over 7,500 data science learners.


gini = 1 - p_sum

return gini
Enter your email Subscribe

class_value_counts = df1[class_name].value_counts()

print(f'Number of samples in each class is:\n{class_value_counts}')

gini_class = gini_impurity(class_value_counts)

print(f'\nGini Impurity of the class is {gini_class:.3f}')

OUT:

Number of samples in each class is:

yes 9

no 5

Name: default, dtype: int64

Gini Impurity of the class is 0.459

Learn Data Science with

https://www.learndatasci.com/glossary/gini-impurity/ 6/9
20.10.22, 20:25 Gini Impurity – LearnDataSci

# STEP 2:

Get updates in
# Calculating gini impurity for the attiributes

def gini_split_a(attribute_name):

your inbox
attribute_values = df1[attribute_name].value_counts()

gini_A = 0

for key in attribute_values.keys():

Join over 7,500 data science learners.


df_k = df1[class_name][df1[attribute_name] == key].value_counts()

Enter
n_k = attribute_values[key]
your email Subscribe
n = df1.shape[0]

gini_A = gini_A + (( n_k / n) * gini_impurity(df_k))

return gini_A

gini_attiribute ={}

for key in attribute_names:

gini_attiribute[key] = gini_split_a(key)

print(f'Gini for {key} is {gini_attiribute[key]:.3f}')

OUT:

Gini for age is 0.343

Gini for income is 0.440

Gini for student is 0.367

Gini for credit_rate is 0.429

Learn Data Science with

# STEP 3:

# Compute Gini gain values to find the best split

# An attribute has maximum Gini gain is selected for splitting.

min_value = min(gini_attiribute.values())

print('The minimum value of Gini Impurity : {0:.3} '.format(min_value))

print('The maximum value of Gini Gain : {0:.3} '.format(1-min_value))

selected_attribute = min(gini_attiribute.keys())

print('The selected attiribute is: ', selected_attribute)

https://www.learndatasci.com/glossary/gini-impurity/ 7/9
20.10.22, 20:25 Gini Impurity – LearnDataSci

OUT:

The minimum value of Gini Impurity : 0.343

The maximum value of Gini Gain


Get updates in
: 0.657

your inbox
The selected attiribute is : age

Join over 7,500 data


Learnscience
Data Sciencelearners.
with

The figure at the top of this pageEnter your email


corresponds Subscribe
to this example.

Get updates in
your inbox
Join over 7,500 data science learners.

Enter your email Subscribe

Meet the Authors

Fatih Karabiber
Ph.D. in Computer Engineering, Data Scientist
Associate Professor of Computer Engineering. Author/co-author of over 30
journal publications. Instructor of graduate/undergraduate courses.
Supervisor of Graduate thesis. Consultant to IT Companies.

https://www.learndatasci.com/glossary/gini-impurity/ 8/9
20.10.22, 20:25 Gini Impurity – LearnDataSci

Back to blog index


Get updates in
your inbox
Join over 7,500 data science learners.

Enter your email Subscribe

Best Data Science Courses Best Machine Learning Courses


Best Udemy Courses

Data Science & Machine Learning Glossary Free Data Science Books

Privacy Policy

© 2022 LearnDataSci. All rights reserved.

Use of and/or registration on any portion of this site constitutes acceptance of


our Privacy Policy. The material on this site may not be reproduced, distributed,
transmitted, cached or otherwise used, except with the prior written permission of
LearnDataSci.com.

https://www.learndatasci.com/glossary/gini-impurity/ 9/9

You might also like