Professional Documents
Culture Documents
INSTITUTE OF TECHNOLOGY
(AN AUTONOMOUS INSTITUTION)
Kalapatti road, Coimbatore-641 048
DATA SCIENCE
MISSING DATA
MISSING DATA / VALUE
• Missing data is defined as the values or data that is not stored (or
not present) for some variable(s) in the given dataset.
df.isna().sum()
Easy way:
▪ Ignore tuples with missing values: This method is appropriate when
the given dataset is large and several values are missed.
▪ Drop missing values: Only appropriate when the data is large.
EASY WAY (CONTD..)
• DELETION
1. Listwise: Referred as “complete case analysis”. For the small size dataset,
it can create a bias and mislead the results. In this case, the entire variable is
going to be deleted.
2. Pairwise: the deletion occurs when some missing data exists. The
subsets with complete cases should be considered, because it preserves more
information.
• Despite being one method of handling with missing data, dropping has a
very important disadvantage. Because of one missing value, entire data
will be deleted, which is valuable during the solutions.
1. Mean/Median/Mode:
->One of the methods of imputation is using mean or median.
-> The mean and median of the particular column should be
calculated and then filled in place of missing data.
-> For the categorical data, nonetheless, the mode function is used.
1 # manually impute missing values with numpy
EXAMPLE 2
3
from pandas import read_csv
from numpy import nan
4 # load the dataset
1 # example of removing rows that contain
2 missing values 5 dataset = read_csv('pima-indians-diabetes.csv',
3 from numpy import nan 6 header=None)
4 from pandas import read_csv 7 # mark zero values as missing or NaN
5 # load the dataset 8 dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0,
6 dataset = 9 nan)
7 read_csv('pima-indians-diabetes.csv', 1 # fill missing values with mean column values
8 header=None) 0 dataset.fillna(dataset.mean(), inplace=True)
9 # summarize the shape of the raw data
1 # count the number of NaN values in each column
1 print(dataset.shape)
0 # replace '0' values with 'nan' 1 print(dataset.isnull().sum())
1 dataset[[1,2,3,4,5]] =
1 dataset[[1,2,3,4,5]].replace(0, nan) OUTPUT:
1 # drop rows with missing values 0 0
2 dataset.dropna(inplace=True) 1 0
1 # summarize the shape of the data with 2 0
3 missing rows removed 3 0
print(dataset.shape) 4 0
5 0
OUTPUT:
6 0
(768, 9) 7 0
(392, 9) 8 0
dtype: int64
ADVANTAGES
• It is quick and easy.
• When the mean is imputed, the mean of the whole column does not
change.
• Usage of mean makes sense, because it a reasonable estimate for
randomly selected observation.
DISADVANTAGES
• It can give poor results for categorical features.
• The variance is reduced and distorts the covariance among the
remaining variables.
THANK YOU!!