You are on page 1of 16

Chapter 3 Conceptual Framework

3.1

Definition of Terms:

Bias is defined as the difference between the expected value of an estimator and the true value of the parameter being estimated. An estimator or decision rule can be positive, negative or even zero. An estimator having nonzero bias is said to be an unbiased estimator. The bias is expressed by:

$ $ Bias ( ) = E[ ] - 
$ where  is the estimator of the true value of the parameter  and  is the true value of the parameter. Accuracy is defined to be the measurement on how close the estimates to the true value. Precision is defined to be the measurement on how close the estimates with one another. Efficiency is defined to be the measurement on how a job is accomplished through a set of criteria with a minimum waste of time, effort or skill. Nonresponse is the failure to obtain valid response from a unit in the survey.

3.2

Types of Nonresponse

The types of nonresponse focus on the method in which the observations are nonresponse values. Kalton (1983) stressed the importance to differentiate the types of nonresponse: total (unit) nonresponse, item nonresponse, and partial nonresponse.

Unit (or Total) nonresponse takes place when no information was collected from a sampling unit. There are many causes of this nonresponse, namely, the failure to contact the respondent (not at home, moved or unit not being found), refused to participate, inability of the unit to cooperate (might be due to an illness or a language barrier) or lost questionnaires.

Item nonresponse, on the other hand, happens when the information is collected from a unit is incomplete. There are many causes of item nonresponse, namely, refusal to answer the question due to the lack of information necessarily needed by the informant, failure to make the effort required to establish the information by retrieving it from his memory or by consulting his records, refusal to give answers because the questions might be sensitive, failure of the interviewer to record an answer or the response is subsequently rejected at an edit check on the grounds that it is inconsistent with other responses (may include an inconsistency arising from a coding or punching error occurring in the transfer of the response to the computer data file).

Partial nonresponse is the failure to collect large sets of items for a responding unit. A sampled unit fails to provide responses for the following reasons, namely, in one or more waves of a panel survey, later phases of a multi-phase data collection procedure (e.g. second visit of the FIES), and later items in the questionnaire after breaking off a telephone interview. Other reasons include, data are unavailable after all possible checking and follow-up, inconsistency of the responses that do not satisfy natural or reasonable constraints known as edits which one or more items are designated as

unacceptable and therefore are artificially missing, and similar causes given in Unit (Total) Nonresponse. In this study, the researchers dealt with Partial Nonresponse occurring in the second visit of the FIES 1997.

3.3.

Patterns of Nonresponse

A critical issue in addressing the problem of nonresponse is identifying the pattern of nonresponse. Determining the patterns of nonresponse is important because it influences how missing data should be handled. There are three patterns of nonresponse namely Missing Completely At Random (MCAR), Missing at Random (MAR) and Non Ignorable Nonresponse (NIN).

A missing data is said to be MCAR if the probability of having a missing value for Y is unrelated to the value of Y itself or to any other variable in the data set. Data that are MCAR reflect the highest degree of randomness and show no underlying reasons for missing observations that can potentially lead to bias research findings (Musil, et al, 2002). Hence, the missing data is randomly distributed across all cases such that the occurrence of missing data is independent to other variables in the data set. An example of the MCAR pattern is when a sample unit in the survey fails to provide an answer to the total monthly expenditure because the unit cannot be reached.

Another pattern of nonresponse is the MAR case. The missing data is considered to be MAR if the probability of missing data on Y is unrelated to the value of Y after controlling for other variables in the analysis. This means that the likelihood of a case

having incomplete information on a variable can be explained by other variables in the data set. An example of the MAR pattern is when a sampling unit fails to provide an answer to the total monthly expenditure because the sampling unit is a male household. The missing information about the total monthly expenditure is dependent on the gender of the sampling unit and not on the total monthly expenditure itself.

Meanwhile, the NIN is regarded as the most problematic nonresponse pattern. When the probability of missing data on Y is related to the value of Y and possibly to some other variable Z even if other variables are controlled in the analysis, such case is termed as NIN. NIN missing data have systematic, nonrandom factors underlying the occurrence of the missing values that are not apparent or otherwise measured. NIN missing data are the most problematic because of the effect in terms of generalizing research findings and may potentially create bias parameter estimates, such as the means, standard deviations, correlation coefficients or regression coefficients (Musil, et al., 2002). An example of the NIN pattern is when a sampling unit from the higher income groups fails to provide information even if the gender of the unit is being controlled. Using the example in the MAR pattern, however, the sampling unit did not also provide answer because he was a high income earner. This is considered NIN since the sampling unit also depends on the income group even if the gender of the unit was controlled. (Musil, et al., 2002)

These patterns are considered as an important assumption before any imputation takes place. For an imputation procedure to work and achieve statistically acceptable and reliable estimates, the pattern of nonresponse must either satisfy the MCAR or MAR

assumption. For this study, the researchers created missing observations that satisfy the MCAR assumption.

3.4NR Bias In most surveys, there is a large propensity of the post-analysis results to become invalid due to the missing data. Missing data can be discarded, ignored or substituted through some procedure. When data is deleted or ignored in generating estimates, the nonresponse bias becomes a problem (Kalton, 1983). The effect of deleting the missing data on NR bias is illustrated below:

Suppose the population is divided in two groups or strata. The population is divided in two groups or strata, the first group consisting of all units in the population for which units will be obtained if the units will be included in the sample (Respondents) and the second group are those units for which no measurement will be obtained (Nonrespondents).

To arrive at the proper estimation of the nonresponse bias, the following quantities are defined:

Let R be the number of respondents and M (M stands for missing) be the number of nonrespondents in the population, with R + M = N. Assume that a simple random sample with replacement is drawn from each group. The corresponding sample quantities for the total number of respondents and nonrespondents are r and m respectively, with r + m = n.

Let R =

R M and M = be the proportions of respondents and nonrespondents in the N N

population and let r =

r n

and m =

m n

be the proportion of respondents and

nonrespondents in the sample. The population total and mean is given by Y = Yr + Ym = RYr + M Y m and Y = Yr + Y m = RYr + M Y m , where Yr and Y r are the total and mean for respondents, respectively and Ym and Y m are the same quantities for the nonrespondents, respectively. The corresponding sample quantities are

y = y r + y m = r y r + m y m and y = r y r + m y m (Kalton, 1983).

If no compensation is made for nonresponse, the respondent sample mean y r is used to estimateY . Its bias is given by B( y r ) = E[ y r ] − Y . The expectation of y r can be obtained in two stages, first conditional on fixed r and then over different values of r, i.e. E [ y r ] = E1E2 [ y r ] where E2 is the conditional expectation for fixed r and E1 is the expectation over different values of r. Thus,

Hence, the bias of y r is given by

The equation above shows that y r is approximately unbiased for Y if either the proportion of nonrespondents M is small or the mean for nonrespondents, Y m is close to

that for respondents, Y r . Since the survey analyst usually has no direct empirical evidence on the magnitude of ( Y r − Y m ), the only situation in which he can have confidence that the bias is small is when the nonresponse rate is low. However, in practice, even with moderate M , many survey results escape sizable biases because ( Y r − Y m ) is fortunately often not large. (Kalton, 1983)

In reducing nonresponse bias caused by missing data, there are many procedures that can be applied and one of these procedures is imputation. In this study, imputation procedures are applied to compensate for nonresponse and reduce bias to the estimates. Imputation is briefly defined as the substitution of values for the nonresponse observations.

3.5

Imputation Process

Imputation is one of the many procedures that can be used to deal with nonresponse to generate unbiased results. Imputation is the process of replacing a missing value, through available statistical and mathematical techniques, with a value that is considered to be a reasonable substitute for the missing information (Kalton, 1983).

Imputation has certain advantages. First, imputation methods help reduce biases in survey estimates. Second, imputation makes analysis easier and the results are simpler to present. Imputation does not make use of complex algorithms to estimate the population parameters in the presence of missing data; hence, much processing time is saved. Lastly, using imputation techniques can ensure consistency of results across analyses, a feature that an incomplete data set cannot fully provide.

On the other hand, imputation has also several disadvantages. There is no guarantee that the results obtained after applying imputation methods will be less biased than those based on the incomplete data set. There is a possibility that the biases from the results using imputation could be greater. Hence, the use of imputation methods depends on the suitability of the assumptions built into the imputation procedures used. Even if the biases of univariate statistics are reduced, there is no assurance that the distribution of the data and the relationships between variables will remain. More importantly, imputation is just a fabrication of data. Many naive researchers falsely treat the imputed data as a complete data set for n respondents as if it were a straightforward sample of size n.

There are four Imputation Methods (IMs) applied in this study, namely, the Overall (Grand) Mean Imputation, Hot Deck Imputation, Deterministic Regression Imputation and Stochastic Regression Imputation. For most imputation methods, imputation classes are needed to be defined in order to proceed in performing the IMs.

Imputation classes are stratification classes that divide the data into groups before imputation takes place. The formation of imputation classes is very useful if the classes are divided into homogeneous groups. That is, similar characteristics have some propensity to provide same response. The variables used to define imputation classes are called matching variables. In getting the values to be substituted to the nonresponse observations, a group of observations coming from a variable with a response are used.

These records are called donors. The records with missing observations to be substituted are called recipients. Problems might arise if the number of imputation classes is not formed with caution to imputation methods that rely on them. The matching variable must have a definite number of classes applied to each method. The larger the number of imputation classes, the possibility of having fewer observations in one class increases. This can cause the variance of the estimates under that class to increase. On the other hand, the smaller the number of imputation class, the possibility of having more observations in that class increases thus making the estimates burdened with aggregation bias.

3.5.1

Overall Mean Imputation

The mean imputation method is the process by which missing data is imputed by the mean of the available units of the same imputation class to which it belongs (Cheng and Sy, 1999). One of the types of this method is the Overall Mean Imputation (OMI) method. The OMI is one of the widely used methods in imputing for missing data. The OMI method simply replaces each missing data by the overall mean of the available (responding) units in the same population. The overall mean is given by:

y omi =

∑y
i =1

r

ri

r

= yr

where yomi is the mean of the entire sample of the responding units of the y-th variable and yri is the observation under y which are responding units.

There are many advantages and disadvantages of this method. The advantage of using this method is its universality. This means that it can be applied to any data set. Moreover, this method does not require the use of imputation classes. Without imputation classes, the method become easier to use and results are generated faster.

However, there are serious disadvantages of this method. Since missing values are imputed by a single value, the distribution of the data becomes distorted (see Figure 1). The distribution of the data becomes too peaked making it unsuitable in many postanalysis. Second, it produces large biases and variances because it does not allow variability in the imputation of missing values. Many related literatures stated that this method is the least effective and thus recommended never to use this method.

Figure 1 Distribution of the Data Before and After Imputation

3.5.2

Hot Deck Imputation

One of the most popular and widely known methods used is the Hot Deck Imputation (HDI) method. The HDI method is the process by which the missing observations are imputed by choosing a value from the set of available units. This value is either selected

at random (traditional hot deck), or in some deterministic way with or without replacement (deterministic hot deck), or based on a measure of distance (nearestneighbor hot deck). To perform this method, let Y be the variable that contains missing data and X that has no missing data. In imputing for the missing data: 1.Find a set of categorical X variables that are highly associated with Y. The X variables to be selected will be the matching variables in this imputation. 2.Form a contingency table based on X variables. 3.If there are cases that are missing within a particular cell in the table, select a case from the set of available units from Y variable and impute the chosen Y value to the missing value. In choosing for the imputation to be substituted to the missing value, both of them must have similar or exactly the same characteristics.

Cheng and Sy (1999) stated that the HDI method produces more accurate estimates with the use of imputation classes. If the matching variables are closely associated with the variable being imputed, the bias should be reduced.

Example 1: Suppose that a survey is conducted with a sample of ten people. In the survey, three people refused to provide their Grade Point Average (GPA) for the previous term. Missing answer from each nonrespondent are replaced by a known value from a responding unit who has similar characteristics such as sex, degree or course (Course), Dean Lister (DL), Honor student in High School (HS2), and Hours of study classes (HSC). Suppose the set of X matching variables that are highly associated to GPA are the variables DL and HS2.

Table 1 shows the data with imputed values. Values in parenthesis are the imputed values that were randomly chosen in their respective imputation classes.

Table 1: Imputed values of GPA using the HDI *Values in parenthesis are imputed value

Person 1 2 3 4 5 6 7 8 9 10

Sex M F F F M M M F F F

DL Y Y N N N N N Y Y Y

HS2 Y N N Y Y N Y N N Y

HSC 2 1 0 0 1 0 1 1 1 1

GPA* [3.999] 3.567 1.298 2.781 2.344 1.111 [2.781] 3.246 [3.246] 3.999

Like OMI, there are certain advantages in using this method. One major attraction of this method cited by Kazemi (2005) is that imputed values are all actual values. More importantly, the shape of the distribution is preserved. Since imputation classes are introduced, the chance in distorting the distribution decreases.

On the other hand, it also has a set of disadvantages. First, in order to form imputation classes, all X variables must be categorical. Second, the possibility of generating a distorted data set increases if the method used in imputing values to the missing observations is without replacement as the nonresponse rate increases. Observations from the donor re-

cord might be used repeatedly by the missing values causing the shape of the distribution to get distorted. Third, the number of imputation classes must be limited to ensure that all missing values will have a donor for each class.

4.3Regression Imputation As in MI and HDI methods, this procedure is one of the widely known used imputation methods. The method of imputing missing values via the least-squares regression is known to be the Regression Imputation (RI) method. There are many ways of creating a regression to be used in imputing for the missing observations. The y-variable for which imputations are needed is regressed on the auxiliary variable (x1, x2, ..., xp) for the units providing a response on y. These auxiliary variables may be quantitative or qualitative, the latter being incorporated into the regression model by means of dummy variables. There are two basic types of the RI method: (a) Deterministic Regression Imputation and (b) Stochastic Regression Imputation.

In comparing for the accuracy and efficiency of the RI method, it will be helpful if the RI methods to be compared have the same imputation class.

4.3.1Deterministic Regression Imputation The use of the predicted value from the model given the values of the auxiliary values that contains no missing data for the record with a missing response in the variable y is called the Deterministic Regression Imputation (DRI). This method is seen as the generalization of the mean imputation method. The model for DRI is given by:

ˆ ˆ ˆ yk = β0 + ∑βi X ik
i =1

p −1

where

$ is the predicted value for the k-th nonresponding unit to be imputed yk
ˆ ˆ β 0 and β i are the parameter estimates Xik is the auxiliary variable that can either be a quantitative variable or a dummy
variable under the k-th nonresponding unit

There are advantages and disadvantages of using DRI. DRI has the potential to produce closer imputed value for the nonresponse observation. In order to make the method effective by imputing a predicted value, which is near the actual value, a high R2 is needed. Though this method has the potential to make closer imputed values, this method is a time-consuming operation and often times unrealistic to consider its application for all the items with missing values in a survey.

Using the DRI can also underestimate the variance of the estimates. It can also distort the distribution of the data. One major disadvantage of this method is that it can produce outof-range values or unfeasible values (e.g. predicting a negative age).

4.3.2Stochastic Regression Imputation The use of the predicted value from the model has similar undesirable distributional properties in the mean imputation method. To compensate for it, an estimated residual is ad-

ded to the predicted value. The use of this predicted value plus some type of randomly chosen estimated residual is called the Stochastic Regression Imputation (SRI) method. The model for SRI is given by:

ˆ ˆ ˆ ˆ yk = β0 + ∑βi X ik + ek
i =1

p −1

where

$ is the predicted value for the k-th nonresponding unit to be imputed yk
ˆ ˆ β 0 and β i are the parameter estimates Xik is the auxiliary variable that can either be a quantitative variable or a dummy
variable under the k-th nonresponding unit

$ e k is the randomly chosen residual for the k-th nonresponding unit

There are various ways in which this could be done depending on the assumptions made about the residuals. The following are some possibilities:

1.Assume that the errors are homoscedastic and normally distributed, N (0, σ e ). Then
2

2 σ e2 could be estimated by the residual variance from the regression, s e and the residual

for a recipient could be chosen at random from N (0, s e ).

2

2 2.Assume that the errors are heteroscedastic and normally distributed, with σ ej being the

residual variance in some group j. Estimate the σ ej by s ej , and choose a residual for a re2

2

cipient in group j from N (0, s ej ). 3.Assume that the residual all come from the same, unspecified distribution. Then estim-

2

ˆ ˆ ˆ ˆ ate y k by y k + ek , where e k is the estimated residual for a random chosen donor.
4.The assumption in (3) accepts the linearity and additivity of the model. If there are doubts about these assumptions, it may be better to take not a random-chosen donor but instead one close to the recipient in terms of his x-values. In the limit, if a donor with the same set of x-values is found, this procedure reduces to assigning that donor’s y-value to the recipient.

There are advantages and disadvantages in using SRI. Similar to DRI, this method can produce imputed values that are near to the nonresponse observation if the model has a high R2. This method is also a time-consuming operation and often times unrealistic to consider its application for all the items with missing values in a survey. This method can also produce out-of-range values other than the predicted value without the added residual. It is possible under SRI that after adding the residual to the deterministic imputation, which is feasible, an unfeasible value could result.