You are on page 1of 21

Missing Value Handling

Techniques

Dr. Mridu Sahu


Assistant Professor, IT
Understanding the ‘why’ of missing
data
1) Data not being collected properly
2) Collection and management errors
3) Data intentionally being omitted
4) Could be created due to transformation of the data
5) Human Error
6) IoT device (like Sensor) malfunctions
7) Bugs in code
HANDLING MISSING VALUES IN DATASETS IS NECESSARY?
I SAY YES! BECAUSE THE DATA IS NOT COMPLETE WITHOUT HANDLING
MISSING VALUES AND MANY MACHINE LEARNING ALGORITHMS DO NOT
ALLOW MISSING VALUES.
“Good design may not eliminate the problem of
missing data, but … it can reduce it, so that the
modern analytic machinery can be used to extract
statistical meaning from study data.
Conversely, we note that when insufficient attention
is paid to missing data at the design stage, it may
lead to inferential problems that are impossible to
resolve in the statistical analysis phase.” —Lavori,
et al. (2008)
Before handling missing values, one should understand why and where data is missing.

One important consideration


in choosing a missing data
approach is the missing data
mechanism—different
approaches have different
assumptions about the
mechanism.
Each of the three
mechanisms describes one
possible relationship between
the propensity of data to be
missing and values of the
data, both missing and
observed.
1.Missing Completely at Random (MCAR)
1) Missing completely at random (MCAR) analysis assumes that
missingness is unrelated of any unobserved data (response and
covariate), meaning that the probability of a missing data value is
independent of any observation in the data set.
2) In this case, missing and observed observations are generated from
the same distribution, means there is no systematic mechanism
that makes the data to be missing more than others. when this
assumption is confirmed, you can perform a complete case(CC)
analysis on the observed data.
3) MCAR produces reliable estimates that are unbiased but still there
is a loss power due to poor design but not due to absence of the
data.
4) Under MCAR analysis, you can analyze the observed observation
and ignore discard any missing observations.

Missing Completely at Random, MCAR, means there is no relationship between the


missingness of the data and any values, observed or missing.
Those missing data points are a random subset of the data. There is nothing systematic
going on that makes some data more likely to be missing than others.
2. Missing at Random (MAR)
• This is kind of missing is referred as missing at random.
missing data at random(MAR) is more common than
missing completely at random(MCAR) in all disciplines.
• In this case, clearly the missing and observed observations
are no longer coming from the same distribution and this is
a crucial distinction between the two methods.

Missing at Random, MAR, means there is a systematic relationship between the


propensity of missing values and the observed data, but not the missing data.
Whether an observation is missing has nothing to do with the missing values, but it does
have to do with the values of an individual’s observed variables. So, for example, if men
are more likely to tell you their weight than women, weight is MAR.
3. Missing at Random (MAR)
• MNAR analysis are problematic because the distribution of
the missing observations do not only depend on the
observed values but also the unobserved values as well.

Missing Not at Random, MNAR, means there is a relationship between the propensity of a
value to be missing and its values. This is a case where the people with the lowest
education are missing on education or the sickest people are most likely to drop out of the
study.
MNAR is called “non-ignorable” because the missing data mechanism itself has to be
modeled as you deal with the missing data. You have to include some model for why the
data are missing and what the likely values are.
“Missing Completely at Random” and “Missing at Random” are both considered ‘ignorable’
because we don’t have to include any information about the missing data itself when we
deal with the missing data.
Two types of Deletions are
Listwise Deletions
Pairwise Deletions
• It is recommended that these deletion techniques
only be used when the data set contains fewer
missing values.
Listwise Deletion
• When a column has an empty or nan, listwise
deletion deletes the entire row.
• As a result of the listwise deletion, the data will be
shrunk.
Pair wise Deletions
• Pair wise deletion makes an attempt to reduce the loss that
happens in list wise deletion.
• It calculates the correlation between two variables for every
pair of variables to which data is considered. The
coefficient of correlation can be used to take such data into
account.
List wise Deletion using Python

• List wise deletion is preferred when there is a Missing


Completely at Random case. In List wise deletion entire
rows(which hold the missing values) are deleted. It is
also known as complete-case analysis as it removes all
data that have one or more missing values. In python
we use dropna() function for List wise deletion.
Pair wise Deletion using Python
• Pairwise Deletion is used if missingness is missing
completely at random i.e MCAR.
• Pairwise deletion is preferred to reduce the loss that
happens in Listwise deletion. It is also called an
available-case analysis as it removes only null observation,
not the entire row.
Missing Value Handling using Data
Interpolation
Finding Missing Values (Python)
1) In order to find the missing value, we can run isnull()
2) print(df.isnull())
3) This will return a binary series (True or False) with the
same index as the original index of the data frame on
which you are trying to find the missing data.
4) To find the total number of missing value
5) print(df.isnull().sum())

You might also like