You are on page 1of 15

Session 5: Handling Missing Data-The

Imputation Method

Dr. Mahesh K C 1
Handling Missing Data
• “If only 5% of data values are missing from a data set of 30 variables, and the
missing values are evenly spread throughout the data, almost 80% of the records
would have at least one missing value” (Gallit Shmueli, Nitin Patel and Peter
Bruce, Data Mining for Business Intelligence).

• Missing values pose problems to data analysis methods.


• More common in databases containing large number of fields.
• Absence of information rarely beneficial to task of analysis.
• Careful analysis is required to handle the issue.

• Delete Records Containing Missing Values?


Not necessarily a best approach
Deleting records creates biased subset
Valuable information in other fields lost

Dr. Mahesh K C 2
Methods to Handling Missing Data
• Replace Missing Values with User-defined Constant
Missing numeric values replaced with 0.0.
Missing categorical values replaced with “Missing”.
• Replace Missing Values with Mode or Mean
Mode of categorical variables.
Mean for numeric variables.

• Replace Missing Values with Random Values from the observed distribution of the
variable
Values randomly taken from underlying distribution.
Method superior compared to mean substitution.
Measures of location and spread remain closer to original.
• Replace Missing values with imputed values.
Estimate the missing value using regression given that all other attributes for a
particular record.
Estimate the missing value using multivariate imputation by chained equation.
Dr. Mahesh
3 KC
The Imputation Method
• The general question asked is “What would be the most likely value for this
missing record, given all the other attributes for a particular records?”.
• For continuous variables, the imputation is done with regression.
• Estimate the missing value by taking the variable as response given that
all other values.
• It uses stepwise regression (forward elimination) to include significant
variables.
• For categorical variables, classification algorithm such as CART will be used.

• Here we use the method: Multivariate Imputation by Chained Equations


(MICE) with Visualize Missing Data package (VIM).
• Predictive Mean Matching for continuous variable and multinomial logistic
regression for categorical variables.

Dr. Mahesh K C 4
Imputation by Stepwise Regression
• First prepare the data for multiple regression, specially categorical variables must be
converted to dummy variables.
• Let Y is the response variable and (X1, X2, …., Xp) are p-predictors.
• Let the predictor X2 has a missing value corresponding to a record say R7.
• Consider the predictor X2 as response variable and all the original predictors (minus X2)
represent predictors.
• Do not include the original response variable Y as predictor for imputation.
• Apply stepwise variable selection method on X2.

• The model starts with no predictors, then the most significant predictor is entered into
the model, followed by next most significant predictor and so on till all significant
predictors have been entered into the model and no further predictors have been
dropped.

• Once the model is built, use the predictors to estimate the value of X2.
Dr. Mahesh K C 5
Imputation by MICE-Procedure
• How do we impute if there are missing values in different predictors?-MICE.
• MICE assumes the data are Missing at Random (MAR) which means that the
probability of a missing value depends only on the observed values and can be
predicted using them.
• Procedure (Azur, et.al., 2011)
Step-1: Replace (impute) the missing values in each variable with temporary “place
holder” values derived from the observed values of the variables. This can be done by
mean imputation.
Step-2: The “place holder” mean imputations for one variable (say X1) are set back as
missing.
Step-3: Regress X1 based on other variables using the observed values. Drop all records
where X1 is missing during the model fitting.
Step-4: The missing values for X1 are then replaced with predictions from the
regression model.
Step-5: Repeat steps 2-4 for each variable that has missing values.
Dr. Mahesh K C 6
MICE Cont’d.
• Applying Steps 2-5 once for each variable constitute a “cycle”.
• MICE requires a number of such cycles and at each cycle the missing values
are updated.
• Generally, the number of cycles is taken as 5 which is specified in advance.
• Retain the imputed values corresponding to the final cycle.

• Note that assumptions on regression model must be taken care off.

Dr. Mahesh K C 7
Predictive Mean Matching (PMM): How it works?
• Let Y is a variable that has some records with missing values. Let X be a set of
variables with no missing data that are used to impute Y.
• Step-1: For records with no missing value, estimate a linear regression Y on X,
producing a set of coefficients b.
• Step-2: Obtain a new set of coefficients b* from the posterior predictive
distribution(ppd) of b. (Bayesian step).
• Step-3: Using b*, get predicted values for all missing records of Y and use b to
get predicted values for observed Y for those data present.
• Step-4: For each record where Y is missing, find the closest predicted values
among records where Y is observed.
• Step-5: From among the close cases, randomly choose one and impute the
missing value with the observed value of this close case.

• Distribution of predicted values given that observed values is called posterior


predictive distribution
Dr. Mahesh K C 8
Illustrative example for PMM
• Let Y={y1, y2, … , y10}. Suppose Y has a missing value say y7.
• Let X={X1, X2, X3} be the set of predictors each has 10 observations with no
missing records.
• Regress Y on X with nine observed values which produces b={b0, b1, b2, b3}.
Estimated model: y = b0+ b1x1 +b2x2 +b3x3.
• Using ppd, let b*={b1*, b2*,.., b10*}.
• For all observed values in Y, get predicted values using b say P={p1, p2, .., p6, p8, p9
, p10}.

• For the missing record y7, get imputed value using b* say p*.
• Choose from P, the predicted values (usually three) which are closest to p* say p1,
p6, and p9.
• The PMM algorithm choose the closest observed value say y9 (corresponding to
p9) as the imputed value for y7. Dr. Mahesh K C 9
Example Cont’d.
• Let Y=(2, 4, 36, 16, 23, 45, y7, 13, 20, 7)
• Let X={X1, X2, X3} be the set of predictors with no missing values.
• Estimated model: y = b0+ b1x1 +b2x2 +b3x3.
• let b*= 10
• Let P={3.5, 5, 31, 13, 27, 44, 11, 18.5, 5.5} based on the above model.
• Predicted values from P which are closest to b*=10 are {13, 11, 5.5}
which corresponds to the observed values {16, 13, 7} in Y.
• Hence the imputed value for y7 = 7.

Dr. Mahesh K C 10
Imputation of Cars Data: Explanation of R-outputs
• The original data has 261 records on 8 variables.
• For the imputation purpose this data set is manipulated in such way that it
includes some missing records corresponding to some variables.
• We made record numbers 20 and 55 corresponding to the variable “mpg”, record
number 42 for “cylinders” and record number 11 for “hp” missing.
• After deleting these entries, the data set is saved in “Impucars.csv”.
• The original entries corresponding to these records in “Cars.csv” are:
R20: 24 and R55: 16.5, R42: 8 and R11: 70.

• summary(Car): This shows that for variables “mpg”, “cylinders” and “hp” there are
missing values (NAs).
• which(is.na(Car$mpg)): Records 20 and 55, entries are missing.
• which(is.na(Car$cylinders)): Record 42 is missing.
• which(is.na(Car$hp)): Record 11 is missing.
Dr. Mahesh K C 11
Explanations Cont’d. Visualizing Missing data
• mis_plot<-aggr(Car, col=c("blue","red"),xlab=names(Car), ylab=c("percent of Missing
Data","Patterns"), cex.axis=0.7, sortVars=TRUE)
• summary(mis_plot)

Missing in combinations of variables:


Combinations Count Percent
0:0:0:0:0:0:0:0 257 98.4674330
0:0:0:1:0:0:0:0 1 0.3831418
0:1:0:0:0:0:0:0 1 0.3831418
1:0:0:0:0:0:0:0 2 0.7662835

• Out of 261 records, 257 records no missing values (98.4%)


Out of 261 records, 0ne record has one missing value (0.38%)
Out of 261 records, 0ne record has one missing value (0.38%)
Out of 261 records, two records each has one missing value (0.76%)

Dr. Mahesh K C 12
Visualizing Missing data Cont’d.
• md.pattern(Car)
cubic weight time year brand cylinders hp mpg
inches lbs to 60
257 1 1 1 1 1 1 1 1 0
2 1 1 1 1 1 1 1 0 1
1 1 1 1 1 1 1 0 1 1
1 1 1 1 1 1 0 1 1 1
0 0 0 0 0 1 1 2 4

Dr. Mahesh K C 13
Explanation Cont’d.
• Imput<-mice(Car, m=3, seed = 123): m stands for the number of imputations required in
each iteration. “Seed” is given for repeatability.
• Imput$imp$mpg: Gives 3 imputed values for R20 and R55 for “mpg”
1 2 3
20 24 20 25
55 15 15 15
• Imput$imp$cylinders: Gives 3 imputed values for R42 of “cylinders”.
1 2 3
42 8 8 8
• Imput$imp$hp: Gives 3 imputed values for R42 of “hp”.
1 2 3
11 67 69 78
• compData<-complete(Imput,1): Replace the missing values by the imputed values from
the first imputation and gives the complete data set. This can be used for further analysis.

Dr. Mahesh K C 14
References

• Larose, D.T. & Larose, C.D. (2016), Data Mining and Predictive
Analytics, 2nd edition, Wiley.
• Azur, M.J., Stuart, E.A., Frangakis, C., & Leaf, P.J. (2011), Multiple
Imputation by Chained Equations: What is it and how does it work?,
International Journal of Methods in Psychiatric Research, 20(1),
40-49.

You might also like