You are on page 1of 2

Draft Only: Subjected to Revision

Formation of Imputation Classes

Most imputation methods require the formation of imputation classes before any
imputation takes place. In this study there were three imputation methods that were applied by
imputation classes, Hot Deck Imputation, Deterministic Regression and Stochastic Regression.
Imputation classes play a major role when it comes to biasness and reliability of estimates that
were produced by different imputation methods. The main goals of imputation classes are: (1)
minimization of the variances within each class and maximize the variance between classes and
(2) reduce the biasness of the estimates. Defined earlier in the methodology, imputation classes
are stratification classes that divide the data into groups. The division of the data can produce
homogeneous groups if proper techniques are applied in creating and selecting imputation
classes and its variables.

Before formation of imputation classes takes place, there were many considerations to be
followed in order to come up with a good set of imputation classes. It is very important to have
variables that are highly correlated and minimizes variances within groups. The variables to be
used in forming imputation classes are called matching or control variables. The groups of
observations coming from the matching variables and the variables to be simulated are called
donors and recipients respectively. Imputation classes can be an effective tool in distinguishing
the best imputation method among all imputation method whether the classes are introduced or
not. Methods with good imputation classes have bigger advantages than methods which does not
have imputation classes. However, it still depends on how the imputation classes are selected.

In this study, the first visit data set was used in order to come up with imputation classes.
The second visit cannot be tested because it is assumed that observations are missing. Evaluation
of all potential variables to be used as matching variables must be done in order to achieve the
best imputation classes for the three methods. Categorical variables that are economically related
to the variables to be simulated were chosen as potential variables. There were three categorical
variables that were chosen as potential matching variables: Province (PROV1), Education Status
(ES1) and Total Employed Household Members (TOTEM1).

It is important to set a definite number of categories for each matching variable to avoid
certain dangers. “If the number of imputation groups decreases, heterogeneity within the groups’
increases and the estimates becomes increasingly burdened with aggregation bias. On the other
hand, as the number of imputation groups’ decreases, this negatively affects the precision of the
estimates, thus inflating their estimates variances.” All potential variables were categorical
variables since no categorization will be done unlike for continuous variables that can cause a
loss of information. In addition to this, in reality, continuous variables are more prone to
nonresponse than categorical variables. Using continuous variables which contains nonresponse
observation in forming imputation classes will just increase biasness of the estimates. However,
since there were too many categories for ES1 and TOTEM1, another set of categorization were
made to reduce the number of categories in those variables.

The imputation matching variables employed in these methods were obtained from the
tests involving the measures of association for nominal data and tests for independence. The Chi-
Square test was applied to know if the potential matching variable is a significant factor to the
Draft Only: Subjected to Revision

nonresponse variable. If the potential matching variable were significant, the tests for the meas-
ures of association would follow. These were done in order to find the best matching variable
that would divide the data into imputation classes with the following characteristics discussed
earlier. There were three measures of association that were used in this study, the Phi-coefficient,
Cramer's V and Contingency Test.

In order to generate faster result, the statistical packages like SPSS and Statistica were
used for this part.

Using the SPSS cross tabulation function, the results are shown below:

Table #.1 Chi-Square Test for Independence

CHI-SQUARE TEST: INCOME VARIABLE CHI-SQUARE TEST: EXPENDITURE VARIABLE


VARIABLES STAT P-VALUE DF VARIABLES STAT P-VALUE DF
PROVINCE 151.78 < 0.0001 9 PROVINCE 137.83 < 0.0001 9
CODES1 613.859 < 0.0001 6 CODES1 687.342 < 0.0001 6
CODEP1 358.436 < 0.0001 9 CODEP1 193.132 < 0.0001 9

Results in Table #.1 showed that all matching variables were significantly associated with
their respective partial nonresponse variables.

Table #.2. Measures of Association

INCOME VARIABLE
PROVINCE CODES1 CODEP1
Phi-Coefficient 0.192 0.386 0.295
Cramer's V 0.111 0.273 0.17
Contingency Test 0.188 0.36 0.283

EXPENDITURE VARIABLE
PROVINCE CODES1 CODEP1
Phi-Coefficient 0.183 0.408 0.216
Cramer's V 0.105 0.288 0.125
Contingency Test 0.18 0.378 0.211

Results in Table #.2 showed that all of the measures of association between the newly
categorized Education Status (CODES1) and its respective partial nonresponse variables topped
against the other matching variables. However, none of the matching variables have a strong as-
sociation with the nonresponse variables. This is not taken serious primarily because in real com-
plex data, the variables really have a weak association or sometimes even have no association at
all. The minimum percentage of association required for this study is twenty percent for all the
tests. In all the tests of association, only CODES1 have at least twenty percent to the partial non-
response variables.

You might also like