You are on page 1of 14

DR. N.G.P.

INSTITUTE OF TECHNOLOGY
(AN AUTONOMOUS INSTITUTION)
Kalapatti road, Coimbatore-641 048

DATA SCIENCE

MISSING DATA
MISSING DATA / VALUE

• Missing data is defined as the values or data that is not stored (or
not present) for some variable(s) in the given dataset.

• Below is a sample of the missing data from the Titanic dataset.


REPRESENTATION

• In the dataset, the blank shows the missing values.


• In Pandas, usually, missing values are represented by NaN. It stands for Not a Number.
TYPES OF MISSING DATA
1. Missing Completely At Random (MCAR)
The probability of data being missing is the same for all the
observations. In this case, missing values are completely independent of
other data. There is no pattern.

2. Missing At Random (MAR)


The reason for missing values can be explained by variables on
which you have complete information, as there is some relationship between
the missing data and other values/data. In this case, the data is not missing
for all the observations. There is some pattern in the missing values.

3. Missing Not At Random (MNAR)


Missing values depend on the unobserved data. If there is some
structure in missing data and other observed data can not explain it, then it
is considered to be MNAR. If the missing data does not fall under the MCAR
or MAR, it is MNAR. It can happen due to the reluctance of people to
provide the required information.
IDENTIFYING
• The identification of missing values is easy with Python because it is
straightforward. First the apt libraries should be imported and the dataset
should be read.
import pandas as pd
df =
pd.read_excel(r'...dataset.xlsx'
)
• The first method of missing data identification is:
df.isna().isany()

• which returns boolean output for the columns. “True” - a column


contains missing values. “False”- shows nonexistence of missing
values.
SECOND METHOD
• The second method of missing data identification is:

df.isna().sum()

• which returns the number of missing values in columns.


IMPORTANT METHODS IN MISSING DATA
• isnull()
Returns a dataframe of boolean values that are True for NaN
values when checking null values in a Pandas DataFrame.
• notnull()
Returns a dataframe of boolean values that are False for NaN
values when checking for null values in a Pandas Dataframe
HANDLING
• There are two main methods of dealing with missing data, namely:
easy and professional.

Easy way:
▪ Ignore tuples with missing values: This method is appropriate when
the given dataset is large and several values are missed.
▪ Drop missing values: Only appropriate when the data is large.
EASY WAY (CONTD..)
• DELETION
1. Listwise: Referred as “complete case analysis”. For the small size dataset,
it can create a bias and mislead the results. In this case, the entire variable is
going to be deleted.

2. Pairwise: the deletion occurs when some missing data exists. The
subsets with complete cases should be considered, because it preserves more
information.

3. Entire variables: If one column contains of 60% missing values, then


this column can be deleted entirely.

4. Dropping: is the process of deleting whole row of data.


PROFESSIONAL WAY

• Despite being one method of handling with missing data, dropping has a
very important disadvantage. Because of one missing value, entire data
will be deleted, which is valuable during the solutions.

• That is why, instead of dropping, imputing the missing values is a better


choice.
PROFESSIONAL WAY(CONTD…)
• IMPUTATION
Imputation is the best strategy for handling missing values. Different
methods for imputation have been presented, ranging from the simple to
complex. fillna() is used for imputation.

1. Mean/Median/Mode:
->One of the methods of imputation is using mean or median.
-> The mean and median of the particular column should be
calculated and then filled in place of missing data.
-> For the categorical data, nonetheless, the mode function is used.
1 # manually impute missing values with numpy
EXAMPLE 2
3
from pandas import read_csv
from numpy import nan
4 # load the dataset
1 # example of removing rows that contain
2 missing values 5 dataset = read_csv('pima-indians-diabetes.csv',
3 from numpy import nan 6 header=None)
4 from pandas import read_csv 7 # mark zero values as missing or NaN
5 # load the dataset 8 dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0,
6 dataset = 9 nan)
7 read_csv('pima-indians-diabetes.csv', 1 # fill missing values with mean column values
8 header=None) 0 dataset.fillna(dataset.mean(), inplace=True)
9 # summarize the shape of the raw data
1 # count the number of NaN values in each column
1 print(dataset.shape)
0 # replace '0' values with 'nan' 1 print(dataset.isnull().sum())
1 dataset[[1,2,3,4,5]] =
1 dataset[[1,2,3,4,5]].replace(0, nan) OUTPUT:
1 # drop rows with missing values 0 0
2 dataset.dropna(inplace=True) 1 0
1 # summarize the shape of the data with 2 0
3 missing rows removed 3 0
print(dataset.shape) 4 0
5 0
OUTPUT:
6 0
(768, 9) 7 0
(392, 9) 8 0
dtype: int64
ADVANTAGES
• It is quick and easy.
• When the mean is imputed, the mean of the whole column does not
change.
• Usage of mean makes sense, because it a reasonable estimate for
randomly selected observation.

DISADVANTAGES
• It can give poor results for categorical features.
• The variance is reduced and distorts the covariance among the
remaining variables.
THANK YOU!!

You might also like