There are two primary types of missing data values:
1. Values missing at random
2. Values missing not at random
The type of variable (categorical or numerical) and how the imputation method affects the distribution of the data should be considered when determining the best imputation method. Common imputation methods include mean/median/mode imputation for numerical or categorical variables respectively, random sampling imputation, and inserting arbitrary values like -1 or 999 for missing numerical values.
There are two primary types of missing data values:
1. Values missing at random
2. Values missing not at random
The type of variable (categorical or numerical) and how the imputation method affects the distribution of the data should be considered when determining the best imputation method. Common imputation methods include mean/median/mode imputation for numerical or categorical variables respectively, random sampling imputation, and inserting arbitrary values like -1 or 999 for missing numerical values.
There are two primary types of missing data values:
1. Values missing at random
2. Values missing not at random
The type of variable (categorical or numerical) and how the imputation method affects the distribution of the data should be considered when determining the best imputation method. Common imputation methods include mean/median/mode imputation for numerical or categorical variables respectively, random sampling imputation, and inserting arbitrary values like -1 or 999 for missing numerical values.
The causes of missing values can be categorized into two primary
types:
A. Value missing at random
B. Value missing, but not at random
What is the type of the variable (data type) of the feature?
How does the imputation method affect the distribution of the
data?
Data type
Data types can be identified commonly as Numerical and Categorical.
These data types affect what method we should use. For example, it wouldn’t be wise to replace a categorical variable with the mean of the variables or replace a numerical variable with a categorical method. 1 . Complete removal of rows or columns of missing values
This is one of the most intuitive and simple methods. As it implies, it
includes removing all rows or columns that have missing values present. Note : This is because the removal of rows and columns could mean losing important information about the data along with the missing values.
Columns of missing values can be completely removed when the
NULL values are significantly more than the other values present. In this situation, it wouldn’t make sense to keep these columns, as they hold little or no descriptive information about the data.
2 .Mean/Median & Mode Imputation
Mean imputation works better if the distribution is normally- distributed or has a Gaussian distribution, while median imputation is preferable for skewed distribution(be it right or left) 3 For categorical variables
Mode imputation means replacing missing values by the mode, or
the most frequent- category value. Note : It distorts the distribution of the dataset
4 Random Sampling Imputation :
` This method involves substituting the missing values with values extracted from the original variable. It can be applied to both numerical and categorical variables. It’s also used when the values are missing at random It does not distort the distribution
Other methods to use, especially if the values
are not missing at random
Arbitrary values imputation
This involves using an arbitrary value to replace the missing values.
One can think of them as placeholders for the missing values. This is a method used for numerical variables The most commonly used numbers for this method are -1, 0,99, -999 (or other combinations of 9s). Deciding on which arbitrary number to use depends on the range of your data’s distribution. For example, if your data is between 1–100, it wouldn’t be wise to use 1 or 99 because those values may already exist in your data, and these placeholder numbers are usually used to flag missing values.
Missing Category Imputation :
This method is used for categorical data. It involves labeling all
missing values in a categorical column as ‘missing’.