You are on page 1of 10

Chapter 2

Review of Related Literature
Much research effort has been devoted in the efficacy of various imputation methods. In the report entitled Compensating for Missing Survey Data, two simulation studies using the data in the 1978 Income Survey Development Program Research Panel were carried out to compare some imputation methods. The first study compared imputation methods for the variable Hourly Rate of Pay while the second dealt with the imputation of the variable Quarterly Earnings. For both studies, the author stratified the data into its imputation classes, constructed data sets with missing values by randomly deleting some of the recorded values in the original dataset and then applied the various imputation methods to fill in the missing values. This process was replicated ten times to ensure consistency of the results. Once the imputation methods have been applied, the three measures for evaluating the effectiveness of imputation methods namely the Mean Deviation, Mean Absolute Deviation and the Root Mean Square Deviation were obtained and averaged across the ten trials. (Kalton, 1983)

For the first study of imputing the variable Hourly Rate of Pay, eight methods were used namely the Grand Mean Imputation (GM), the

Class Mean Imputation using eight imputation classes (CM8), the Class Mean Imputation using ten imputation classes (CM10), Random

Imputation with eight imputation classes (RM8), Random Imputation with ten imputation classes (RM10), Multiple Regression Imputation (MI), Multiple Regression Imputation plus a random residual chosen from a normal distribution (MN) and Multiple Regression Imputation plus a randomly chosen respondent residual (MR). Using the Mean Deviation criteria, the results showed that all mean deviations were negative, indicating that the imputed values underestimated the actual values. Moreover, the results show that the Grand Mean Imputation (GM) has the greatest underestimation among the eight procedures. Meanwhile for the Mean Absolute Deviation and Root Mean Square Deviation, which measures the ability to reconstruct the deleted value, the results showed that the Grand Mean Imputation fared the worst for both criteria. In addition, it also showed that the Multiple Regression Imputation (MI) obtained the best measures for the two criteria and that the procedures with greater number of imputation classes (i.e.CM8 VS. CM10, RC8 VS. RC10) yield slightly better results for the two criteria. (Kalton, 1983)

For the second study, which is the imputation of Quarterly Earnings, ten imputation procedures were used. These are the Grand Mean Imputation (GM), the Class Mean Imputation using eight imputation

classes (CM8), the Class Mean Imputation using twelve imputation classes (CM12), Random Imputation with eight imputation classes (RM8), Random Imputation with twelve imputation classes (RM12), Multiple Regression Imputation (MI), Multiple Regression Imputation plus a random residual chosen from a normal distribution (MN), Multiple Regression Imputation plus a randomly chosen respondent residual (MR), Mixed Deductive and Random Imputation using eight imputation classes (DI8) and Mixed Deductive and Random Imputation using twelve imputation classes (DI12). Using the first criteria, the Mean Deviation, the results showed that the Grand Mean (GM) obtained a positive bias. This implied that the grand mean imputation is not an effective imputation method for the this study. The results also showed that the regression imputation procedures have almost similar results producing almost unbiased estimates. In addition, the Class Mean Imputation methods (CM8 and CM12) have similar measures with those of the Random Imputation Methods. Nevertheless, all methods have produced relatively small mean deviations except for the last two methods. Comparing the Mean Absolute Deviations and the Root Mean Square Deviations, the results show that the Grand Mean Imputation obtained values similar to the regression procedures with residuals (i.e. Multiple Regression Imputation plus a random residual chosen from a normal distribution or MN, Multiple Regression Imputation plus a randomly chosen respondent residual or MR). The results also show that

the RC8. RC12, MN and MR procedures are over one third larger compared to deterministic procedures such as the CM8, CM12 and MI procedures. (Kalton, 1983).

To further investigate the relatively larger biases of DI8 and DI12 procedures, the author further divided the date into the deductive and non deductive cases. This shed further light on the Mean Deviations and Mean Absolute Deviations of the various imputation methods. It was found that the mean deviations are positive on the deductive case and negative on the non deductive case for all of the procedures. These then explains why there are relatively small deviations in the previous results since the measures between the cases tend to cancel out. It also showed that the DI8 and DI12 results are similar to those of the RC8, RC12, CM8 and CM12 in the non deductive cases but are largely different in the deductive cases. This explains the larger values of DI8 and DI12 in the previous results. (Kalton, 1983)

At the end of the two studies, it showed that the imputation procedures tend to overestimate the Hourly Rate of Pay and underestimate the Quarterly Earnings. Moreover, it showed how the mean imputation appears to be the weakest imputation method among the studies since it has distorted the distribution of the original data. Lastly, Kalton’s study shows the impact of increasing the imputation classes

with respect to the criteria used such that it gives a better yield of values for the three criteria.

In contrast to Kalton’s criteria in measuring the performance of imputation procedures, a paper entitled A Comparison of Imputation Techniques for Missing Data by C. Musil, C. Warner, P. Yobas and S. Jones, the authors presented a much simple approach in evaluating the performance of imputation techniques by using the means, standard deviation and correlation coefficients, then comparing the statistics of the original data with the statistics obtained from the five methods namely Listwise deletion, Mean Imputation, Deterministic Regression, Stochastic Regression and EM Method. The Expectation Maximization (EM) Method is an iterative procedure that generates missing values by using expectation (E-step) and maximization (M-step) algorithms. The E-step calculates expected values based on all complete data points while the M-step replaces the missing values with E-step generated values and then recomputed new expected values. (Musil, Warner, Yobas & Jones, 2002)

Using the Center for Epidemiological Studies data on stress and health ratings of older adults, the authors imputed a single variable namely the functional health rating. Of the 492 cases, 20% cases were deleted in an effort to maximize the effects of each imputation meth-

od. Except for the Listwise Deletion and Mean Imputation, the researchers used the SPSS Missing Value Analysis function for the Deterministic Regression, Stochastic Regression and EM Method. For the correlations, the researchers obtained the correlation values of the original data and the five methods of the imputed variable with the variables, age, gender and self assed health rating. (Musil, Warner, Yobas & Jones, 2002) The results show that comparing the mean of the original data with the five methods, all imputed values underestimated the mean. The closest to the original data was the Stochastic Regression, followed very closely by EM Method, Deterministic Regression, Listwise Deletion and Mean Imputation. The same results also hold for the standard deviations. For the correlations, however, the EM Method produced the closest correlation values to the original data followed closely by the Stochastic Regression, Deterministic Regression, Listwise Deletion and Mean Imputation. Hence, the Finding suggests that the Stochastic Regression and EM Method performed better while the Mean Imputation is the least effective. (Musil, Warner, Yobas & Jones, 2002)

In another study by Nordholt entitled Imputation Methods, Simulation, Experiments and Practical Examples, the authors described two simulation experiments of the Hot Deck Method. The first study focused on comparing whether the Hot Deck Method performs better than leaving the records with nonresponse out of the data set when

analyzing the variable, which is known as the Available Case Method. This was done by constructing a fictitious data set of four values; two of these variables were used for the imputation. Then nonresponse rates were identified namely 5%, 10% and 20% and the simulation process was replicated 50 times. The data set containing the missing values was first analyzed using the Available Case Method then followed by the Hot Deck Imputation. Same with the methodology of Musil et.al., descriptive statistics such as the mean, variance and correlation were computed. Moreover, the absolute differences between the original and the available case method also with the original and hot deck method were computed. Based on his criteria, the results show that Hot Deck performs better than the Available Case Method. Also, it showed that the Hot Deck, while had closer results with the original data, has the tendency to underestimate the values. In terms of the absolute differences, it was observed that these values increase when the percentage of missing values also increases. (Nordholt, 1998)

Nordholt’s second simulation study focused on the effects of covariates, otherwise known as imputation classes on the quality of the Hot Deck Imputation. Using the data of the Dutch Housing Demand Survey of Statistics Netherlands, the variable value of the house was chosen as the variable to be imputed due to its importance and the frequency of nonresponse occurring in that variable. For this study, the

observations under category 13 (value worth at least 150,000) and category 22 (value worth at 300,000) are changed into missing values. The rationale for this choice was to ensure that the original value from these categories will note be used as the replacements for the variable to be imputed since it is no longer in the file. Then imputation classes were created once the missing values were already identified. A table showing the number of respondents before and after imputation showed that in every category except for 13 and 22, which was set as missing values, the number of respondents increased after the imputation. This showed that the remaining records have equal probability of becoming a donor record for an imputation and that not all imputations give values that are near category 13 or 22. Nordholt also explored on the Available Case Method and Hot Deck Method for this real life data. Same with the first study, the Hot Deck fared better than the Available Case Method. (Nordholt, 1998)

Lastly, Nordholt addressed several questions regarding imputation. Using examples of how imputation is applied on the real life surveys such as the Dutch Housing Demand Survey, European Community Household Panel Survey (ECHP) and the Dutch Structure of Earning Survey, he outline four criteria to decide which variables to be imputed. These are the importance of a variable, the percentage of nonresponse, the predictability of missing values and the cost of imputa-

tion. He also mentioned how it is important to estimate the duration of the imputation process due to the need of the study to be timely. The duration, according to Nordholt, is dependent on the number of variables to be imputed, the available capacity, the user friendliness of an imputation package and the desired imputation quality. These issues must be settled first before conducting any imputation process and choosing the appropriate imputation strategy. (Nordholt, 1998)

There were two undergraduate theses that conducted a similar study on imputation. The first undergraduate thesis was by Salvino and Yu. They assessed the efficiency of the Mean Imputation versus Hot Deck Imputation Technique by applying these techniques on the 1991 Census on Agriculture and Fisheries (CAF) data. In their research, they generated an incomplete data using the Gauss Software for the imputed variables which were the count for cattle, hogs and chicken. In order to determine which is better between the two, the variances were compared. Looking at the variances, it was determined that the Hot Deck Imputation Technique was better. Also, the design effect was considered by dividing the variance of the Hot Deck Imputation versus the Mean Imputation, since the ratio produced was less than one, they concluded that again, the Hot Deck Imputation Technique is a better option. (Salvino and Yu, 1996)

Another undergraduate thesis by Cheng and Sy focused on assessing imputation techniques on a clinical data. The authors employed four methods of imputation namely Mean Imputation, Hot Deck Imputation, Linear Regression and Multiple Linear Regression. They assessed the efficacy of the imputation techniques by looking at the accuracy and precision of the estimates. Accuracy was measured by the percentage error and the variance of these percentage errors were the basis for the precision of the estimates. The results show that the Linear Regression was the best method, followed closely by Multiple Regression, then Hot Deck and finally the Mean Imputation. (Cheng and Sy, 1999)