You are on page 1of 8

What causes missing data values?

The causes of missing values can be categorized into two primary


types:

A. Value missing at random

B. Value missing, but not at random

 What is the type of the variable (data type) of the feature?

 How does the imputation method affect the distribution of the


data?

Data type

Data types can be identified commonly as Numerical and Categorical.


These data types affect what method we should use. For example, it
wouldn’t be wise to replace a categorical variable with the mean of the
variables or replace a numerical variable with a categorical method.
1 . Complete removal of rows or columns of missing values

This is one of the most intuitive and simple methods. As it implies, it


includes removing all rows or columns that have missing values
present.
 Note : This
is because the removal of rows and columns could mean
losing important information about the data along with the missing
values.

 Columns of missing values can be completely removed when the


NULL values are significantly more than the other values present.
In this situation, it wouldn’t make sense to keep these columns, as
they hold little or no descriptive information about the data.

2 .Mean/Median & Mode Imputation


Mean imputation works better if the distribution is normally-
distributed or has a Gaussian distribution, while median imputation is
preferable for skewed distribution(be it right or left)
3 For categorical variables

Mode imputation means replacing missing values by the mode, or


the most frequent- category value.
 Note : It distorts the distribution of the dataset

4 Random Sampling Imputation :


`
This method involves substituting the missing values with values
extracted from the original variable. It can be applied to both
numerical and categorical variables. It’s also used when the values are
missing at random
It does not distort the distribution

Other methods to use, especially if the values


are not missing at random

Arbitrary values imputation

This involves using an arbitrary value to replace the missing values.


One can think of them as placeholders for the missing values. This is a
method used for numerical variables
The most commonly used numbers for this method are -1, 0,99, -999
(or other combinations of 9s). Deciding on which arbitrary number to
use depends on the range of your data’s distribution. For example, if
your data is between 1–100, it wouldn’t be wise to use 1 or 99 because
those values may already exist in your data, and these placeholder
numbers are usually used to flag missing values.

Missing Category Imputation :

This method is used for categorical data. It involves labeling all


missing values in a categorical column as ‘missing’.

You might also like