You are on page 1of 101

Imputation Procedures for Partial Nonresponse: The Case of 1997 Family Income and Expenditure Survey (FIES

)

A Thesis Presented to The Faculty of the Mathematics Department College of Science De La Salle University - Manila

In Partial Fulfillment of the Requirements for the Degree Bachelor of Science in Statistics Major in Actuarial Science

by Diana Camille B. Cortes James Edison T. Pangan

August 2007

Approval Sheet
The thesis entitled Imputation Procedures for Partial Nonresponse: The Case of 1997 FIES Submitted by Diana Camille B. Cortes and James Edison T. Pangan, upon the recommendation of their adviser, has been accepted and approved in partial fulfillment of the requirements for the degree of Bachelor of Science in Statistics Major in Actuarial Science.

ARTURO Y. PACIFICADOR JR., Ph.D. Thesis Adviser PANEL OF EXAMINERS

RECHEL G. ARCILLA, Ph.D. Chairperson IMELDA E. de MESA Member Date of Oral Defense: August 25, 2007 MICHELE G. TAN Member

Acknowledgments
The researchers would like to extend their warmest gratitude to the following people, who have undoubtedly contributed to the success of this study: • To Dr. Jun Pacificador Jr., for his supervision, suggestions and guidance during the duration of this thesis. • To Dr. Ederlina Nocon, for providing us the software LaTeX during THSMTH1 • To our parents especially Jed’s mother, Mrs. Erlinda Pangan, for constantly reminding the researchers (i.e. ”Tapos na ba ang thesis nyo?”) about the thesis. • To Mark Nanquil and Norman Rodrigo, for helping us in using LaTeX and for their unwavering support to our thesis • To our friends from COSCA, La Salle Debate Society and Math Circle for their continuous encouragement and support. • Lastly, to The Lord Almighty, for providing us the strength, patience, wisdom and determination to finish this thesis.

Table of Contents

Title Page Approval Sheet Acknowledgments Table of Contents Abstract 1 The Problem and Its Background 1.1 1.2 1.3 1.4 1.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Statement of the Problem . . . . . . . . . . . . . . . . . . . . . . Objectives of the Study . . . . . . . . . . . . . . . . . . . . . . . . Significance of the Study . . . . . . . . . . . . . . . . . . . . . . . Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . . .

i ii iii iv vii 1 1 4 4 5 6 8 17 17 19

2 Review of Related Literature 3 Conceptual Framework 3.1 3.2 Nonresponse Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . Nonresponse and Its Patterns . . . . . . . . . . . . . . . . . . . .

v 3.3 3.4 Types of Nonresponse . . . . . . . . . . . . . . . . . . . . . . . . The Imputation Procedures . . . . . . . . . . . . . . . . . . . . . 3.4.1 3.4.2 3.4.3 Overall Mean Imputation (OMI) . . . . . . . . . . . . . . Hot Deck Imputation (HDI) . . . . . . . . . . . . . . . . . General Regression Imputation . . . . . . . . . . . . . . . Stochastic Regression . . . . . . . . . . . . . . . . . . . . . 4 Methodology 4.1 Source of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 4.1.2 4.1.3 4.1.4 4.2 4.3 4.4 General Background . . . . . . . . . . . . . . . . . . . . . Sampling Design and Coverage . . . . . . . . . . . . . . . Survey Characteristics . . . . . . . . . . . . . . . . . . . . Survey Nonresponse . . . . . . . . . . . . . . . . . . . . . 21 23 25 27 31 32 34 34 34 35 35 36 37 39 40 40 42

The Simulation Method . . . . . . . . . . . . . . . . . . . . . . . Formation of Imputation Classes . . . . . . . . . . . . . . . . . . Performing the Imputation Techniques . . . . . . . . . . . . . . . 4.4.1 4.4.2 4.4.3 Overall Mean Imputation (OMI) . . . . . . . . . . . . . . Hot Deck Imputation (HDI) . . . . . . . . . . . . . . . . . Deterministic and Stochastic Regression Imputation (DRI) and (SRI) . . . . . . . . . . . . . . . . . . . . . . . . . . .

43 45 45

4.5

Comparison of Imputation Techniques . . . . . . . . . . . . . . . 4.5.1 4.5.2 The Bias and Variance of the Estimates . . . . . . . . . . Comparing the Distributions of the Imputed vs. the Actual Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

vi 4.5.3 Other Measures in Assessing the Performance of the Imputation Methods . . . . . . . . . . . . . . . . . . . . . . . . 4.5.4 Determining the Best Imputation Method . . . . . . . . . 48 50 51 51 52

5 Results and Discussion 5.1 5.2 Descriptive Statistics of Second Visit Data Variables . . . . . . . . Formation of Imputation Classes . . . . . . . . . . . . . . . . . . 5.2.1 Mean of the Simulated Data by Nonresponse Rate for Each Variables of Interest . . . . . . . . . . . . . . . . . . . . . 5.2.2 5.2.3 Regression Model Adequacy . . . . . . . . . . . . . . . . . Evaluation of the Different Imputation Methods . . . . . . Overall Mean Imputation . . . . . . . . . . . . . . . . . . 5.2.4 5.2.5 5.2.6 5.3 Hot Deck Imputation . . . . . . . . . . . . . . . . . . . . . Deterministic Regression Imputation . . . . . . . . . . . . Stochastic Regression Imputation . . . . . . . . . . . . . .

57 58 62 62 65 68 70 78 83 87

Choosing the Best Imputation . . . . . . . . . . . . . . . . . . . .

6 Conclusion 7 Recommendations for Further Research

Abstract
Several imputation methods have been developed for imputing missing responses. It is often not clear which imputation method is ”best” for a particular assumption. In choosing an imputation method, several factors should be considered such as the types of estimates that will be generated, the type and pattern of nonresponse,and the availability of the auxiliary data that are highly correlated with characteristic of interest or with the response propensity.

This study compared the effectiveness of four imputation procedures namely the Overall Mean, Hot Deck, Deterministic and Stochastic Regression Imputation using the first visit variable to be its auxiliary variable. A total of 4,130 cases were simulated in the study. Values for variables second visit Total Income and Expenditures (TOTIN2 and TOTEX2) were set to nonresponse to satisfy the assumption of partial nonresponse. The results of the study provide some support for the following conclusions: (a) for the 1997 FIES data, the Hot Deck Imputation and Overall Mean Imputation methods are not appropriate for handling partial nonresponse data; (b) stochastic regression imputation was selected as the best imputation method; and (c) the imputation classes must be homogeneous to produce less biased estimates.

Chapter 1

The Problem and Its Background
1.1 Introduction

Missing data in sample surveys is inevitable. The problem of missing data occurs for various reasons such as when the respondent moved to another location, refused to participate in the survey or is unable to answer specific items in the survey. This failure to obtain responses from the units selected in the sample is called nonresponse. There are several types of nonresponse; (a) Unit nonresponse refers to the failure to collect any data from a sample unit; (b) while item nonresponse refers to the failure to collect valid responses to one or more items from a responding sample unit; and (c) partial nonresponse occurs when there is a failure to collect responses for large sets or a block of items (i.e. in cases of surveys with two phases, the same respondent cannot answer in the second phase of the survey hence the items for the second phase of the survey are missing) for a responding unit.

In surveys that has more than one round of data collection, the problem of nonresponse becomes more complicated. In surveys of this type, it is likely possible

2 that a unit would respond to the first round of the survey but eventually the same unit would fail to answer on the succeeding rounds of the survey. Hence, partial nonresponse occurs.

The effect of nonresponse must not be ignored since it leads to biased estimates which if large would result to inaccuracy. Bias due to nonresponse is believed to be a function of nonresponse rates and the difference in characteristic between responding and nonresponding units. The larger the nonresponse rate or the wider the difference in characteristic between the responding and nonresponding units, the result will lead to a larger bias.

In practice, there are three ways of handling missing data. These are discarding the missing values, applying weighting adjustments or using imputation techniques. Discarding the missing values or otherwise known as the Available Case Method is based on excluding the nonresponse records when analyzing the variable of interest. The problem with this method is that it doesn’t account for the difference in characteristic between the responding and nonresponding units. Hence, methods for compensating missing data are applied. The first method is called weighting adjustments. Weighting adjustments is based on matching nonrespondents to respondents in terms of data available on nonrespondents and increasing the weights of matched respondents to account for the missing values. Hence, a weight proportionate to the amount of nonresponse is often multiplied to the inverse of the response rate. This is often applied for unit nonresponse. On the other hand, imputation is also used by statisticians to account for non-

3 response, usually in the case of item and partial nonresponse. In imputation, a missing value is replaced by a reasonable substitute for the missing information. Once nonresponse has been dealt with, whether by weighting adjustments or imputation, then researchers can proceed with their data analysis.

The Family Income and Expenditure Survey (FIES) is an example of a survey which has more than one round of data collection. The FIES is a nationwide survey of households conducted every three years with two visits per survey period on the sample unit by the National Statistics Office (NSO) in order to provide information of the country’s income distribution, spending patterns and poverty incidence. Like any other survey, FIES encounters the problem of missing data, particularly the problem of nonresponse during the second visit. Given the various contributions that this survey can provide, it is then important to have precise estimates of the income and expenditure indicators.

With the 1997 FIES as the data set for this study, this paper will focus on dealing with partial nonresponse through the use of imputation techniques. It aims to examine the effects of imputed values in coming up with estimates for the missing data at various nonresponse rates. Furthermore, the study aims to determine which imputation techniques is appropriate for the FIES data through applying some of the methods mentioned in the study about the 1978 Research Panel Survey for the Income Survey Development Program (ISDP) entitled Compensating for Missing Data by Kalton (1983).

4

1.2

Statement of the Problem

This paper attempts to answer the following questions: 1. Which imputation technique is the most appropriate for the FIES data? 2. How do varying nonresponse rates affect the results for each imputation method?

1.3

Objectives of the Study

The paper will attempt to achieve the following objectives: 1. To compare the imputation techniques namely Overall Mean Imputation, Hot Deck Imputation, Deterministic Regression and Stochastic Regression, based on its efficiency and ability to recapture the deleted values by generating the missing values on the FIES 1997 second visit data using the first visit data of the same survey. 2. To investigate the effect of the varying rates of missing observations, particularly the effect of 10%, 20% and 30% nonresponse rates on the precision of the estimates.

5

1.4

Significance of the Study

Nonresponse is a common problem in conducting surveys. The presence of nonresponse in surveys causes to create incomplete data, which could pose serious problems during data analysis, particularly in the generation of statistically reliable estimates. For this reason, the use of imputation techniques enables to account for the difference between respondents and nonrespondents. This then helps reduce the nonresponse bias in the survey estimates.

Since most statistical packages require the use of complete data before conducting any procedure for data analysis, the use of imputation techniques can ensure consistency of results across analyses, something that an incomplete data set cannot fully provide.

In a news article by Obanil(2006) entitled Topmost Floor of the NSO Building gutted by Fire posted at Manila Bulletin Online, it mentioned that last October 3, 2006 around 1 Million Pesos worth of documents were destroyed by the fire. Given the importance of documents kept by NSO such as FIES, it is then important to be able to devise methods of compensating missing data.

In terms of statistical research, most countries in the developing world such as the United States, Canada, UK and the Netherlands already employ imputation techniques in their respective national statistical offices. In a country such as the Philippines, where data collection is very difficult especially for some regions like the National Capital Region (NCR), imputation will be able to ease the problem

6 of data collection and nonresponse. This can even make us at par with our counterparts in the developing world in terms of statistical research.

More importantly, given the great impact of this survey to the country, employing imputation techniques help statisticians to provide a method in handling nonresponse, which could lead to a more meaningful generalization about our country’s income distribution, spending patterns and poverty incidence. Hence, having estimates with less bias and more consistent results, this can contribute in making our policymakers and economists provide better solutions in improving the lives of the Filipinos.

1.5

Scope and Limitations

Throughout this paper, only the 1997 Family Income and Expenditure Survey (FIES), will be used to tackle the problem of nonresponse and to examine the impact of the different imputation methods applied in the dataset. Other methods of handling nonresponse will not be covered in this paper. With regards to the extent of how these imputation methods will be applied and evaluated, this paper will only cover the partial nonresponse occurring in the National Capital Region (NCR) since NCR is noted as the region with highest nonresponse rate. Also, the variables that will be imputed for this study would be the Total Income (TOTIN2) and Total Expenditures (TOTEX2) in the second visit of the FIES data.

7

The researchers will only focus on using the 1997 FIES data on the first visit to impute the partial nonresponse that is present on the second visit. This paper also assumes that the first visit data is complete and the pattern of nonresponse follows Missing Completely at Random (MCAR) case. The Missing Completely At Random case happens if the probability of response to Y is unrelated to the value of Y or to any other variables; making the missing data randomly distributed across all cases (Musil et. al, 2002). If the pattern on nonresponse does not satisfy the MCAR assumption, imputation techniques may not achieve its purpose.

As for the imputation techniques, only four imputation methods will be applied for this paper namely: Overall Mean Imputation (OMI), Hot Deck Imputation (HDI), Deterministic Regression Imputation (DRI) and Stochastic Regression Imputation (SRI).

On the aspect of evaluating the efficacy and appropriateness of the four imputation methods, this will only be limited to the following: (a) Nonresponse Bias and Variances of the Imputed Data, (b) Assessment of the Distributions of the Imputed vs. the Actual Data and the criteria mentioned in the report entitled Compensating for Missing Data(Kalton, 1983) namely the Mean Deviation, Mean Absolute Deviation and the Root Mean Square Deviation.

Chapter 2

Review of Related Literature
Much research effort has been devoted in the efficacy of various imputation methods. In the report entitled Compensating for Missing Survey Data, the author carried out two simulation studies using the data in the 1978 Income Survey Development Program (ISDP) Research Panel to compare some imputation methods. The first study compared imputation methods for the variable Hourly Rate of Pay while the second dealt with the imputation of the variable Quarterly Earnings. For both studies, the author stratified the data into its imputation classes, constructed data sets with missing values by randomly deleting some of the recorded values in the original dataset and then applied the various imputation methods to fill in the missing values. This process was replicated ten times to ensure consistency of the results. Once the imputation methods have been applied, the three measures for evaluating the effectiveness of imputation methods namely the Mean Deviation, Mean Absolute Deviation and the Root Mean Square Deviation were obtained and averaged across the ten trials. (Kalton, 1983)

For the first study of imputing the variable Hourly Rate of Pay, eight methods were used namely the Grand Mean Imputation (GM), the Class Mean Imputa-

9 tion using eight imputation classes (CM8), the Class Mean Imputation using ten imputation classes (CM10), Random Imputation with eight imputation classes (RM8), Random Imputation with ten imputation classes (RM10), Multiple Regression Imputation (MI), Multiple Regression Imputation plus a random residual chosen from a normal distribution (MN) and Multiple Regression Imputation plus a randomly chosen respondent residual (MR). Using the Mean Deviation criteria, the results showed that all mean deviations were negative, indicating that the imputed values underestimated the actual values. Moreover, the results show that the Grand Mean Imputation (GM) has the greatest underestimation among the eight procedures. Meanwhile for the Mean Absolute Deviation and Root Mean Square Deviation, which measures the ability to reconstruct the deleted value, the results showed that the Grand Mean Imputation fared the worst for both criteria. In addition, it also showed that the Multiple Regression Imputation (MI) obtained the best measures for the two criteria and that the procedures with greater number of imputation classes (i.e.CM8 VS. CM10, RC8 VS. RC10) yield slightly better results for the two criteria. (Kalton, 1983)

For the second study, which is the imputation of Quarterly Earnings, ten imputation procedures were used. These are the Grand Mean Imputation (GM), the Class Mean Imputation using eight imputation classes (CM8), the Class Mean Imputation using twelve imputation classes (CM12), Random Imputation with eight imputation classes (RM8), Random Imputation with twelve imputation classes (RM12), Multiple Regression Imputation (MI), Multiple Regression Imputation plus a random residual chosen from a normal distribution (MN), Multiple Regres-

10 sion Imputation plus a randomly chosen respondent residual (MR), Mixed Deductive and Random Imputation using eight imputation classes (DI8) and Mixed Deductive and Random Imputation using twelve imputation classes (DI12). Using the first criteria, the Mean Deviation, the results showed that the Grand Mean (GM) obtained a positive bias. This implied that the grand mean imputation is not an effective imputation method for the this study. The results also showed that the regression imputation procedures have almost similar results producing almost unbiased estimates. In addition, the Class Mean Imputation methods (CM8 and CM12) have similar measures with those of the Random Imputation Methods. Nevertheless, all methods have produced relatively small mean deviations except for the last two methods. Comparing the Mean Absolute Deviations and the Root Mean Square Deviations, the results show that the Grand Mean Imputation obtained values similar to the regression procedures with residuals (i.e. Multiple Regression Imputation plus a random residual chosen from a normal distribution or MN, Multiple Regression Imputation plus a randomly chosen respondent residual or MR). The results also show that the RC8. RC12, MN and MR procedures are over one third larger compared to deterministic procedures such as the CM8, CM12 and MI procedures. (Kalton, 1983).

To further investigate the relatively larger biases of DI8 and DI12 procedures, the author further divided the date into the deductive and non deductive cases. This shed further light on the Mean Deviations and Mean Absolute Deviations of the various imputation methods. It was found that the mean deviations are positive on the deductive case and negative on the non deductive case for all of the

11 procedures. These then explains why there are relatively small deviations in the previous results since the measures between the cases tend to cancel out. It also showed that the DI8 and DI12 results are similar to those of the RC8, RC12, CM8 and CM12 in the non deductive cases but are largely different in the deductive cases. This explains the larger values of DI8 and DI12 in the previous results. (Kalton, 1983)

At the end of the two studies, it showed that the imputation procedures tend to overestimate the Hourly Rate of Pay and underestimate the Quarterly Earnings. Moreover, it showed how the mean imputation appears to be the weakest imputation method among the studies since it has distorted the distribution of the original data. Lastly, Kalton’s study shows the impact of increasing the imputation classes with respect to the criteria used such that it gives a better yield of values for the three criteria.

In contrast to Kalton’s criteria in measuring the performance of imputation procedures, a paper entitled A Comparison of Imputation Techniques for Missing Data by C. Musil, C. Warner, P. Yobas and S. Jones, the authors presented a much simple approach in evaluating the performance of imputation techniques by using the means, standard deviation and correlation coefficients, then comparing the statistics of the original data with the statistics obtained from the five methods namely Listwise deletion, Mean Imputation, Deterministic Regression, Stochastic Regression and EM Method. The Expectation Maximization (EM) Method is an iterative procedure that generates missing values by using expectation (E-step)

12 and maximization (M-step) algorithms. The E-step calculates expected values based on all complete data points while the M-step replaces the missing values with E-step generated values and then recomputed new expected values. (Musil, Warner, Yobas and Jones, 2002)

Using the Center for Epidemiological Studies data on stress and health ratings of older adults, the authors imputed a single variable namely the functional health rating. Of the 492 cases, 20% cases were deleted in an effort to maximize the effects of each imputation method. Except for the Listwise Deletion and Mean Imputation, the researchers used the SPSS Missing Value Analysis function for the Deterministic Regression, Stochastic Regression and EM Method. For the correlations, the researchers obtained the correlation values of the original data and the five methods of the imputed variable with the variables, age, gender and self assed health rating. (Musil, Warner, Yobas and Jones, 2002) The results show that comparing the mean of the original data with the five methods, all imputed values underestimated the mean. The closest to the original data was the Stochastic Regression, followed very closely by EM Method, Deterministic Regression, Listwise Deletion and Mean Imputation. The same results also hold for the standard deviations. For the correlations, however, the EM Method produced the closest correlation values to the original data followed closely by the Stochastic Regression, Deterministic Regression, Listwise Deletion and Mean Imputation. Hence, the Finding suggests that the Stochastic Regression and EM Method performed better while the Mean Imputation is the least effective. (Musil, Warner, Yobas and Jones, 2002)

13

In another study by Nordholt entitled Imputation Methods, Simulation, Experiments and Practical Examples, the authors described two simulation experiments of the Hot Deck Method. The first study focused on comparing whether the Hot Deck Method performs better than leaving the records with nonresponse out of the data set when analyzing the variable, which is known as the Available Case Method. This was done by constructing a fictitious data set of four values; two of these variables were used for the imputation. Then nonresponse rates were identified namely 5%, 10% and 20% and the simulation process was replicated 50 times. The data set containing the missing values was first analyzed using the Available Case Method then followed by the Hot Deck Imputation. Same with the methodology of Musil et.al., descriptive statistics such as the mean, variance and correlation were computed. Moreover, the absolute differences between the original and the available case method also with the original and hot deck method were computed. Based on his criteria, the results show that Hot Deck performs better than the Available Case Method. Also, it showed that the Hot Deck, while had closer results with the original data, has the tendency to underestimate the values. In terms of the absolute differences, it was observed that these values increase when the percentage of missing values also increases. (Nordholt, 1998)

Nordholt’s second simulation study focused on the effects of covariates, otherwise known as imputation classes on the quality of the Hot Deck Imputation. Using the data of the Dutch Housing Demand Survey of Statistics Netherlands, the variable value of the house was chosen as the variable to be imputed due to its

14 importance and the frequency of nonresponse occurring in that variable. For this study, the observations under category 13 (value worth at least 150,000) and category 22 (value worth at 300,000) are changed into missing values. The rationale for this choice was to ensure that the original value from these categories will note be used as the replacements for the variable to be imputed since it is no longer in the file. Then imputation classes were created once the missing values were already identified. A table showing the number of respondents before and after imputation showed that in every category except for 13 and 22, which was set as missing values, the number of respondents increased after the imputation. This showed that the remaining records have equal probability of becoming a donor record for an imputation and that not all imputations give values that are near category 13 or 22. Nordholt also explored on the Available Case Method and Hot Deck Method for this real life data. Same with the first study, the Hot Deck fared better than the Available Case Method. (Nordholt, 1998)

Lastly, Nordholt addressed several questions regarding imputation. Using examples of how imputation is applied on the real life surveys such as the Dutch Housing Demand Survey, European Community Household Panel Survey (ECHP) and the Dutch Structure of Earning Survey, he outline four criteria to decide which variables to be imputed. These are the importance of a variable, the percentage of nonresponse, the predictability of missing values and the cost of imputation. He also mentioned how it is important to estimate the duration of the imputation process due to the need of the study to be timely. The duration, according to Nordholt, is dependent on the number of variables to be imputed, the available

15 capacity, the user friendliness of an imputation package and the desired imputation quality. These issues must be settled first before conducting any imputation process and choosing the appropriate imputation strategy. (Nordholt, 1998)

There were two undergraduate theses that conducted a similar study on imputation. The first undergraduate thesis was by Salvino and Yu. They assessed the efficiency of the Mean Imputation versus Hot Deck Imputation Technique by applying these techniques on the 1991 Census on Agriculture and Fisheries (CAF) data. In their research, they generated an incomplete data using the Gauss Software for the imputed variables which were the count for cattle, hogs and chicken. In order to determine which is better between the two, the variances were compared. Looking at the variances, it was determined that the Hot Deck Imputation Technique was better. Also, the design effect was considered by dividing the variance of the Hot Deck Imputation versus the Mean Imputation, since the ratio produced was less than one, they concluded that again, the Hot Deck Imputation Technique is a better option. (Salvino and Yu, 1996)

Another undergraduate thesis by Cheng and Sy focused on assessing imputation techniques on a clinical data. The authors employed four methods of imputation namely Mean Imputation, Hot Deck Imputation, Linear Regression and Multiple Linear Regression. They assessed the efficacy of the imputation techniques by looking at the accuracy and precision of the estimates. Accuracy was measured by the percentage error and the variance of these percentage errors were the basis for the precision of the estimates. The results show that the Linear Regression

16 was the best method, followed closely by Multiple Regression, then Hot Deck and finally the Mean Imputation. (Cheng and Sy, 1999)

Chapter 3

Conceptual Framework
3.1 Nonresponse Bias

In most surveys, there is a large propensity of the post-analysis results to become invalid due to the missing data. Missing data can be discarded, ignored or substituted through some procedure. When data is deleted or ignored in generating estimates, the nonresponse bias becomes a problem. (Kalton, 1983) This section examines the nonresponse bias as a result of discarding the missing data and using only the data from the responding units in the survey analysis.

To be able to understand the concept of nonresponse bias better, this section would only pertain to the concept of nonresponse in general and would not mention anything regarding the types and patterns of nonresponse, as these would be discussed later in the subsections of this chapter.

Consider a Simple Random Sample (SRS) in the variable y, where y contains missing data, from a population of size N is drawn. The population will then be assumed that it can be divided in two groups, the first group being the respon-

18 dents and the other one being the nonrespondents.

Let R be the number of respondents and M (M stands for missing) be the number of nonrespondents in the population, with R + M = N ; the corresponding sample ¯ quantities are ( r) and ( m), with r + m = n. Let R =
R N

¯ and M =

M N

be the
r n

proportions of respondents and nonrespondents in the population and let r = ¯ and m = ¯
m n

be the response and nonresponse rates in the sample. The population

¯ ¯ ¯ ¯¯ ¯¯ total and mean are given by Y = Yr + Ym = RYr + M Ym and Y = RYr + M Ym , ¯ ¯ where Yr and Yr are the total and mean for respondents and Ym and Ym are the same quantities for the nonrespondents. The corresponding sample quantities are y = yr + ym = r¯r and y = ryr + m¯m .(Kalton, 1983) y ¯ ¯¯ ¯y

If no compensation is made for nonresponse, the respondent sample mean yr is ¯ ¯ ¯ ¯ ¯ used to estimate Y . Its bias is given by B(Yr ) = E(Yr ) − Y . The expectation of yr ¯ can be obtained in two stages, first conditional on fixed r and then over different values of r, i.e. E(¯r ) = E1 E2 (¯r where E2 is the conditional expectation for fixed y y r and E1 is the expectation over different values of r. Thus, E(¯r ) = E1 [ y Hence, the bias of yr is given by ¯ ¯ ¯ ¯ ¯ ¯ B(¯r ) = Yr − Y = M (Yr − Ym ). y ¯ The equation above shows that yr is approximately unbiased for Y if either the ¯ ¯ ¯ proportion of nonrespondents M is small or the mean for nonrespondents, Ym , ¯ is close to the respondents, Yr . Since the survey analyst usually has no direct

P E (y i)
2 r

r

¯ ¯ ] = E1 (Yr ) = Yr .

19 ¯ ¯ empirical evidence on the magnitude of (Yr − Ym ), the only situtation in which he can have confidence that the bias is small is when the nonresponse rate is low. ¯ However, in practice, even with moderate M many survey results escape sizable ¯ ¯ baises because (Yr − Ym ) is fortunately often not large. (Kalton, 1983)

In reducing nonresponse bias caused by missing data, there are many procedures that can be applied and one of these procedures is imputation. In this study, imputation procedures are applied to eliminate nonresponse and reduce bias to the estimates. Imputation is briefly defined as the substitution of values for the nonresponse observations. The discussion of imputation procedures will be provided in the later portions of this chapter.

3.2

Nonresponse and Its Patterns

This section gives a more in depth explanation about nonresponse and its patterns. It also presents the rationale why it is important to identify the nonresponse pattern should be taken into consideration before creating procedures in addressing the problem of missing data.

A critical issue in addressing the problem of nonresponse is identifying the pattern of nonresponse. Determining the patterns of nonresponse is important because it influences how missing data should be handled. There are three patterns of nonresponse namely Missing Completely At Random, Missing at Random

20 and Non Ignorable Nonresponse. A missing data is said to be Missing Completely At Random (MCAR) if the probability of having a missing value for Y is unrelated to the value of Y itself or to any other variable in the data set. Data the are MCAR reflect the highest degree of randomness and show no underlying reasons for missing observations that can potentially lead to bias research findings (Musil, Warner, Yobas and Jones, 2002). Hence, the missing data is randomly distributed across all cases such that the occurrence of missing data is independent to other variables in the data set.

Another pattern of nonresponse is the Missing At Random (MAR) case. The missing data is considered to be MAR if the probability of missing data on Y is unrelated to the value of Y after controlling for other variables in the analysis. This means that the likelihood of a case having incomplete information on a variable can be explained by other variables in the data set.

Meanwhile, the Non Ignorable Nonresponse (NIN) is regarded as the most problematic nonresponse pattern. When the probability of missing data on Y is related to the value of Y and possibly to some other variable Z even if other variables are controlled in the analysis, such case is termed as Non Ignorable Nonresponse (NIN). NIN missing data have systematic, nonrandom factors underlying the occurrence of the missing values that are not apparent or otherwise measured. NIN missing data are the most problematic because of the effect in terms of generalizing research findings and may potentially create bias parameter estimates, such as the means, standard deviations, correlation coefficients or regression co-

21 efficients.(Musil, Warner, Yobas and Jones, 2002)

These patterns are considered as an important assumption before any imputation takes place. For an imputation procedure to work and achieve statistically acceptable and reliable estimates, the pattern of nonresponse must either satisfy the MCAR or MAR assumption. For this study, the researchers’ created missing observations that satisfy the MCAR assumption.

3.3

Types of Nonresponse

Another important issue in dealing with missing data is the type of nonresponse. While the patterns of nonresponse focus on the relationships of the nonresponse variable to other variables, the types of nonresponse focus on the method in which the observations are nonresponse values. Kalton (1983) stressed the importance to differentiate the types of nonresponse: noncoverage, total (unit) nonresponse, item nonresponse, partial nonresponse.

Noncoverage denotes the failure to include some units of the survey population in the sampling frame. As a consequence, units that are excluded in the frame have no chance of appearing in the sample. NC is not usually a type of nonresponse; however, Kalton (1983) loosely classifies this for convenience purposes. NC can be seen in surveys where units are failed to cover in the sampling frame or the listing of units are incomplete.

22

Unit (or Total) nonresponse takes place wherein no information collected from a sampling unit. There are many causes of this nonresponse, namely, the failure to contact the respondent (not at home, moved or unit not being found), refusal to collect information, inability of the unit to cooperate (might be due to an illness or a language barrier) or questionnaires that are lost.

Item nonresponse, on the other hand, happens when the information collected from a unit is incomplete due to the refusal of answering some of the questions. There many causes of item nonresponse, namely, refusal to answer the question due to the lack of information necessarily needed by the informant, failure to make the effort required to establish the information by retrieving it from his memory or by consulting his records, refuses to give answers because the questions might be sensitive, embarrassing or considers to his perception of the survey’s objectives, the interviewer fails to record an answer or the response is subsequently rejected at an edit check on the grounds that it is inconsistent with other responses (may include an inconsistency arising from a coding or punching error occurring in the transfer of the response of the computer data file).

Lastly, Partial Nonresponse is the failure to collect large sets of items for a responding unit. A sampled unit fails to provide responses for the following, namely, in one or more waves of a panel survey, later phases of a multi-phase data collection procedure (e.g. second visit of the FIES), and later items in the questionnaire after breaking off a telephone interview. Other reasons namely in-

23 clude, data are unavailable after all possible checking and follow-up, inconsistency of the responses that do not satisfy natural or reasonable constraints known as edits which one or more items are designated as unacceptable and therefore are artificially missing, and similar causes in Unit (Total) Nonresponse. In this study, the researchers dealt with Partial Nonresponse occurring in the second visit of the FIES 1997.

3.4

The Imputation Procedures

Earlier, imputation is listed as one of the many procedures that can be used to deal with nonresponse in order to generate more unbiased results. Imputation defined by Kalton is the process of replacing a missing value through available statistical and mathematical techniques, with a value that is considered to be a reasonable substitute for the missing information. (Kalton, 1983)

Imputation has certain advantages. First, utilizing imputation methods help reduce biases in survey estimates. Second, imputation makes analysis easier and the results are simpler to present. Imputation does not make use of complex algorithms to estimate the population parameters in the presence of missing data hence, much processing time is saved. Lastly, using imputation techniques can ensure consistency of results across analyses , a feature that an incomplete data set cannot fully provide.

24 On the other hand, imputation has also several disadvantages. There is no guarantee that the results obtained after applying imputation methods will be less biased than those based on the incomplete data set. There is a possibility that the biases from the results using imputation could be greater. Hence, the use of imputation methods depends on the suitability of the assumptions built into the imputation procedures used. Even if the biases of univariate statistics are reduced, there is no assurance that the distribution of the data and the relationships between variables will remain. More importantly, imputation is just a fabrication of data. Many naive researchers falsely treat the imputed data as a complete data set for n respondents as if it were a straightforward sample of size n.

Given that imputed values are substituted for missing responses, there are a variety of methods in which the imputed value may be determined. These methods are called Imputation Procedures or Methods. Imputation Methods are techniques applied to replace missing values. These techniques can either implement statistical or simply mathematical procedures like replacing an observation by a constant value (e.g. mean).

There are four IMs applied in this study, namely, the Overall (Grand) Mean Imputation (OMI), Hot Deck Imputation (HDI), Deterministic Regression Imputation (DRI) and Stochastic Regression Imputation (SRI). For most imputation methods, imputation classes are needed to be defined in order to proceed in performing the imputation methods.

25 Imputation classes are stratification classes that divide the data into groups before imputation takes place. The formation of imputation classes is very useful if the classes are divided into homogeneous groups. That is, similar characteristics that has some propensity to provide the same response. The variables used to define imputation classes are called matching variables. In getting the values to be substituted to the nonresponse observations, a group of observations coming from a variable with a response are used. These records are called donors. The missing observations to be substituted are called recipients.

Problems might arise if imputation classes are not formed with caution. One of them is the number of imputation classes. The imputation class must have a definite number of classes applied to each method. The larger the number of imputation class, the possibility of having fewer observations in one class increases. This can cause the variance of the estimates under that class to increase. On the other hand, the smaller the number of imputation class, the possibility of having more observations in that class increases thus making the estimates burdened with aggregation bias.

3.4.1

Overall Mean Imputation (OMI)

The mean imputation method is the process by which missing data is imputed by the mean of the available units of the same imputation class to which it belongs. (Cheng, 1999) One of the types of this method is the OMI method. The OMI method simply replaces each missing data by the overall mean of the available

26 (responding) units in the same population. The overall mean is given by
r

yri yomi = ¯
i=1 r

= yr ¯

where yomi is the mean of the entire sample of the responding units of the yth variable and yri is the observation under y which are responding units.

In performing this method, the need for an imputation class to be homogeneous is unnecessary. The imputation class for this method is the entire population itself. In fact, in many related literature, imputation classes is not a requirement and often ignored in performing this method.

There are many advantages and disadvantages of this method. The advantage of using this method is its universality. This means that it can be applied to any data set. Moreover, this method does not require the use of imputation classes to be homogeneous or the variables to be highly correlated. Without imputation classes, the method becomes easier to use and results are generated faster. Among the related literature included in this study, this is the most used method in imputing for missing data.

27

Figure 1 Distribution of the Data Before and After Imputation

However, there are serious disadvantages of this method. Since missing values are imputed by a single value, the distribution of the data becomes distorted (see Figure 1). The distribution of the data becomes too peaked making it unsuitable in many post-analysis. Second, it produces large biases and variances because it does not allow variability in the imputation of missing values. Many related literatures stated that this is the least effective and it is highly discouraged to used this method.

3.4.2

Hot Deck Imputation (HDI)

One of the most popular and widely known methods used is the Hot Deck Imputation (HDI) method. The HDI method is the process by which the missing observations are imputed by choosing a value from the set of available units. This value is either selected at random (traditional hot deck), or in

28 some deterministic way with or without replacement (deterministic hot deck), or based on a measure of distance (nearest-neighbor hot deck). To perform this method, let Y be the variable that contains missing data and X that has no missing data. In imputing for the missing data: 1. Find a set of categorical X variables that are highly associated with Y . The X variables to be selected will be the matching variables in this imputation. 2. Form a contingency table based on X variables. 3. If there are cases that are missing within a particular cell in the table, select a case from the set of available units from Y variable and impute the chosen Y value to the missing value. In choosing for the imputation to be substituted to the missing value, both of them must have similar or exactly the same characteristics. Cheng (1999) stated that HDI procedure gets estimates reflect more accurately to the actual data by making imputation classes. If the matching variables are closely associated with the variable being imputed, the nonresponse bias should be reduced which is similar to the advantage of imputation classes stated earlier.

29 Example 1: Suppose that a study is conducted among ten people. Assume that three people in the survey refused to answer some of the questions in the study. Replacing the missing answer from each unobserved unit by a known value from an observed unit who has similar characteristics such as sex, degree or course (Course), Dean Lister (DL), Honor student in High School (HS2), and Hours of study classes (HSC). Suppose the set of X matching variables are DL and HS2. Choosing randomly for the values to be imputed,

Table 1: Using the Hot Deck Imputation to Impute the GPA

Person 1 2 3 4 5 6 7 8 9 10

Sex M F F F M M M F F F

DL Y Y N N N N N Y Y Y

HS2 Y N N Y Y N Y N N Y

HSC 2 1 0 0 1 0 1 1 1 1

GPA -[3.999] 3.567 1.298 2.781 2.344 1.111 -[2.781] 3.246 -[3.246] 3.999

30 The use of hot deck imputation is justified. First, imputed values came from the same class, nonresponse bias and variance of the estimates decrease. This is because the observation coming from the imputation classes are homogeneous. If the OMI method was used here, the bias and variance of the estimates would definitely increase. More importantly, the distribution of the data was preserved. In OMI, it can be sure that the distribution will be distorted since the only one value would be substituted for the missing values.

Like OMI, there are certain advantages in using this method. One major attraction of this method cited by Kazemi (2005) is that imputed values are all actual observed values. Another is the nonexistence of out-of-range values or impossible values. Out-of-range values are one of the problems of the Deterministic Regression Imputation (DRI) procedure which will be tackled in the next section. More importantly, the shape of the distribution is preserved. Since imputation classes are introduced, the chance in distorting the distribution decreases.

On the other hand, it also has a set of disadvantages. In order to form imputation classes, all X variables must be all categorical. Second, the possibility of generating a distorted data set increases if the method used in imputing values to the missing observations is without replacement as the nonresponse rate increases. Observations from the donor record might be used repeatedly by the missing values causing the shape of the distribution to get distorted. Third, the number of imputation classes must be limited to ensure that all missing values will have a donor for each class.

31

3.4.3

General Regression Imputation

As in MI and HDI methods, this procedure is one of the widely known used imputation methods. The method of imputing missing values via the least-squares regression is known to be the regression imputation (RI) method. This technique is seen as the generalization of the group mean imputation (GMI), another name for mean imputation which have been discussed previously.

There are many ways of creating a regression model. In Kalton’s study, the value for which imputations are needed y is regressed on the matching variables (x1 , x2 , ..., xp ) for the units providing a response on y. The imputation classes in this method are the categories of the matching variables that were transformed to dummy variables in the model. The matching variables may be quantitative or qualitative, the latter being incorporated into the regression model by means of dummy variables. The missing value may then be imputed into two basic ways: (a) to use the predicted value from the model given the values of the matching variables for the record with a missing response or (b) to use this predicted value plus some type of randomly chosen residual. The former one is called the Deterministic Regression Imputation (DRI) and latter one is called the Stochastic Regression Imputation (SRI). (Kalton, 1983)

In comparing the accuracy and efficiency of this method, it will be helpful if the methods to be compared have the same imputation class. In Kalton’s study,

32 there were two quantitative matching variables that were considered each with a few categories so that no categorization will be needed. The general model underlying based on imputation classes is in the form: ˆ yk = β0 + ˆ ˆ βi xik + ek ˆ

ˆ ˆ where β0 and βi are the parameter estimates computed from the r responding units, xik is the dummy independent variable which are the matching variables in the data under the kth nonresponding units of the ith matching variable, ek the ˆ random residual and yk the predicted value under the kth nonresponding unit to ˆ be imputed.

Stochastic Regression The use of the predicted value from the model corresponds to the mean value imputation in the restricted model, and hence has the same undesirable distributional properties. A good case therefore exists for including the estimated residual. There are various ways in which this could be done depending on the assumptions made about the residuals. The following are some of the more obvious possibilities:
2 1. Assume that the errors are homoscedastic and normally distributed,N (0, σe ). 2 Then σe could be estimated by the residual variance from the regression,s2 , e

and the residual for a recipient could be chosen at random from N (0, s2 ) e 2. Assume that the errors are heteroscedastic and normally distributed, with
2 2 σej being the residual variance in some group j. Estimate the σej by s2 , ej

and choose a residual for a recipient in group j fromN (0, s2 ). ej

33 3. Assume that the residuals all come from the same, unspecified, distribution. Then estimate yk by yk + ek , where ei is the estimated residual for a randomˆ ˆ ˆ chosen donor. 4. The assumption in (3) accepts the linearity and additivity of the model. If there are doubts about these assumptions, it may be better to take not a random-chosen donor but instead one close to the recipient in terms of his x-values (see Kalton, 1983). In the limit, if a donor with the same set of x-values is found, this procedure reduces to assigning that donor’s y-value to the recipient. There are many advantages and disadvantages of RI. RI has the potential to produce closer imputed values for the nonresponse observations, however, missing data is known to be assumption-free, rough-and-ready and imputation class approaches. Though this method has the potential to make closer imputed values, this method is a time-consuming operation and often times unrealistic to consider its application for all the items with missing values in a survey. In order to make the method effective by imputing a predicted value which is near the actual value, a high R2 is needed.(Kalton, 1983)

On the part of the deterministic and stochastic regression, a few disadvantages should be noted. In DRI, the distortion of the distribution becomes too peaked and the variance is underestimated. Comparing this to its stochastic counterpart, while deterministic imputed value was feasible, it is possible under the SRI that after adding the residual to the deterministic imputation, an unfeasible value could result.

Chapter 4

Methodology
4.1 Source of Data

The purpose of this section is to give an overview about the data that will be used for this study which is the 1997 Family Income and Expenditures Survey (FIES).

4.1.1

General Background

The 1997 FIES is a nationwide survey with two visits per survey period on the same households conducted by the National Statistics Office (NSO) every three years. The objectives of the survey are as follows: 1. to gather data on family income and family living expenditures and related information affecting income and expenditure levels and patterns in the Philippines; 2. to determine the sources of income and income distribution, levels of living and spending patterns, and the degree of inequality among families; 3. to provide benchmark information to update weights in the estimation of

35 consumer price index, and 4. to provide information in the estimation of the country’s poverty threshold and incidence.

4.1.2

Sampling Design and Coverage

The sampling design method for the 1997 FIES is a stratified multi - stage sampling design consisting of 3,416 Primary Sampling Units (PSU’s) for the provincial estimate, the PSU’s referred by the 1997 FIES are the barangays. Then, a subsample of 2,247 PSU’s comprises as the master sample for the regional level estimates (NSO, 1997-2005).

This multi stage sampling design involved three stages. First is the selection of sample barangays. Second is the selection of sample enumeration areas. Enumeration areas pertains to the subdivision of barangays. This was followed by a selection of sample households. The sampling frame and stratification of the three stages were based on the 1995 Census of Population (POPCEN) and 1990 Census of Population and Housing (CPH). From this method, a sample of 41,000 households participated in this survey (NSO, 1997-2005).

4.1.3

Survey Characteristics

The 1997 FIES questionnaire contains about 800 data items, where questions are asked by the interviewer to the respondent of the selected sample household. A re-

36 spondent is defined as the household head or the person who manages the finances of the family or any member of the family who can give reliable information to the questionnaire (NSO, 1997-2005). The items or variables gathered in the 1997 FIES is listed in Appendix A.

4.1.4

Survey Nonresponse

Two types of nonresponse occurred in the 1997 FIES. The first type of nonresponse which resulted from factors such as being unaware of the question, unwilling to provide the answer or omission of the question during the interview is called the item nonresponse.This type of nonresponse totaled to only 2.1% of the total number of respondents (NSO, 1997-2005).

The other type of nonresponse which is due to households being temporarily away, on vacation, not at home, demolished or transferred residence during the second visit is called as partial nonresponse. This type of nonresponse totaled to only 3.6% of the total number of respondents (NSO, 1997-2005).

The NSO has only devised the deductive imputation for solving the problem of item nonresponse while no specific method was mentioned to compensate for the partial nonresponse (NSO, 1997-2005).

Hence, the researchers will focus on the comparison of imputation procedures for partial nonresponse. The researchers chose which regional data set will be used to apply the imputation techniques. In this case, the National Capital Region (NCR)

37 was chosen because it was noted as the region with highest nonresponse rate. The data consist of 4,130 households, 39 categorical variables and the rest are continuous variables pertaining to income and expenditures of the respondents. As to which variables will be imputed, the researchers chose two variables namely the second visit Total Income (TOTIN2) and Total Expenditure (TOTEX2). The selections for these variables were chosen due to its importance to the FIES and the frequency of missing values for these observations.

4.2

The Simulation Method

In order to investigate and make an empirical comparison of the statistical properties of the estimates with imputed values using selected imputation methods, a data set with missing observations was simulated. This simulation method will create an artificial data set with missing observations to indicate which values will be imputed.

The alogrithm for this simulation procedure is as follows: 1. To get the number of observations to be set to missing for each nonresponse rate, the total number of observations from the complete 1997 FIES data set, which is 4130,was multiplied to the indicated nonresponse rate. The nonresponse rates used for this study were 10%, 20% and 30%. The rational for setting different nonresponse rate is because the study aims to investigate the effect of varying nonresponse rates for each imputation method.

38 2. Each observation from the matrix of random numbers was assigned to both observations of the 1997 FIES second visit variables TOTIN2 and TOTEX2. This was done in order to satisfy the assumptions that the data has partial nonresponse and that the missing observations follow the Missing Completely At Random (MCAR) nonresponse pattern. 3. The second visit observations for both variables were sorted in ascending order through their corresponding random number. 4. The first 10% of the sorted second visit data for both variables were selected and set to as missing observations. The same procedure goes for the data set which will contain 20% and 30% nonresponse rates respectively. 5. The missing observations were flagged. This was done to distinguish the imputed from the actual values during the data analysis. This simulation method was implemented with the use of the Decimal Basic program, SIMULATION.BAS (Appendix B) where the files Simulated Values for Income (SIMI) and Simulated Values for Expenditure (SIME), a matrix containing missing observations for the income and expenditure were stored in order to use it in the application of the imputation methods.

39

4.3

Formation of Imputation Classes

Imputation classes are stratification classes that divide the data in order to produce groups that have similar characteristics. Assuming that the units that have the same characteristics have the propensity to give the same response, the formation of imputation classes would help reduce the bias of the estimates.

The steps undertaken in the formation of the imputation classes are as follows: 1. The researchers identified the potential matching variables, which are the candidate variables that could have an association with the variables of interest (i.e. TOTEX2 and TOTIN2). 2. The categorical variables from the first visit data must fit into the criteria in order to be selected as a candidate variable. Three criteria were used as a basis for selecting the candidate variables. The first criterion is that the variable must be known. Second, the candidate variable must be easy to measure. Lastly, the probability of missing observations for the candidate variable is small. If the variable from the first visit data would fit in the three criteria, then it can be used as a candidate variable. 3. For the variables that have many categories, the researchers reduced the number of categories for these variables. The rationale for this procedure is because having too many categories can increase heterogeneity and the bias of the estimates. This was done with the use of the software Statistica, particularly, the Recode function.

40 4. Measures of association were tested on the matching variables. The Chi Squared Test for Independence was the first test applied on the variables. This was made to determine if the candidate variables is a significant factor for the variables of interest. 5. Other tests for measuring the association of matching variables to the variables of interest followed. For the other tests of association, the Phi-coefficient, Cramer’s V and Contingency Test were used. The candidate variable with the greatest degree of association will be chosen as the matching variable that will group the data into their respective imputation class. All these tests were performed using the statistical packages Statistica and SPSS. The results of these tests were presented in the next chapter.

4.4
4.4.1

Performing the Imputation Techniques
Overall Mean Imputation (OMI)

The Overall Mean Imputation (OMI) is an imputation procedure where the missing observations are replaced with the mean of the variable which contains available units. As said in the Conceptual Framework, this imputation method does not require the formation of imputation classes, which makes this method as the simplest procedure among the four methods in this study.

The procedures in applying the Overall Mean Imputation (OMI) are as follows:

41 1. The overall mean for the variables of interest, which is the first visit TOTIN1 and TOTEX1 was computed. The formula that was used for the computation of the overall mean is:
r

yri yomi = ¯
i=1 r

where yomi is the overall mean for the first visit TOTEX1 or TOTIN1 while ¯ yri is the first visit observation for the variable TOTEX1 or TOTIN1 and r is the total number of responding units for the first visit variable TOTEX1 or TOTIN1. 2. Using the nonresponse data sets generated, the missing observations for the second visit variables TOTEX2 and TOTIN2 were replaced with the overall means of the first visit TOTEX1 and TOTIN1. The implementation of the Overall Mean Imputation (OMI) was made through the Decimal Basic program OMI.BAS. (Appendix B).

42

4.4.2

Hot Deck Imputation (HDI)

The Hot Deck (HDI) Imputation is an imputation procedure where the missing observations are replaced by choosing a value from the set of available units.

The steps undertaken in applying the Hot Deck (HD) Imputation are as follows: 1. The donor and recipient record for each imputation class and variable were first identified. 2. The missing observations of the second visit TOTIN2 and TOTEX2 were assigned to their respective recipient records for each imputation class while the first visit TOTIN2 and TOTEX2 observations were placed to their respective donor records for each imputation class. 3. The values that were substituted for the missing observations were randomly chosen from the donor record for each imputation class. The implementation of the Hot Deck (HD) Imputation was made through the Decimal Basic program HOT DECK.BAS. (Appendix B)

43

4.4.3

Deterministic and Stochastic Regression Imputation (DRI) and (SRI)

Deterministic Regression Imputation (DRI) is a procedure that involves the creation of a Least Squares Regression where Y is regressed on the matching variables (x1 , x2 , ..., xp ) in order to predict for the missing value. On the other hand, Stochastic Regression Imputation (SRI) is an imputation method which employs a similar procedure to that of the deterministic regression but with an additional procedure of adding an error term e to the estimated value in order to predict for the missing data. The steps employed for the Regression Imputation are as follows: 1. A logarithmic transformation was applied for the first TOTEX and TOTEX as well as for the second visit of the variables TOTEX2 and TOTIN2. The rationale for this transformation is that the income and expenditure variables are not normally distributed. Moreover, logarithmic transformations help correct the non-linearity of the regression equation. 2. The formation of regression equation was done after the transformation. For this study, only one predictor variable was used and the general formula for the regression equation is: ˆ y = β0 + β1 x + ei ˆ ˆ ˆ where y is the predicted observation for the second visit variable TOTIN2 or ˆ ˆ ˆ TOTEX2, β0 and β1 are the parameter estimates, x is the first visit variable, and ei is the random residual term. Note that for DRI, ei = 0. ˆ ˆ

44 3. For the stochastic regression which involves the computation of the error term, the following steps were made: (a) A frequency distribution of the residuals was created. This involved the following steps: i. The residuals were grouped into class intervals and in each interval, the frequencies for each was obtained. ii. The relative frequencies and relative cumulative frequencies were computed. (b) The class means of the frequency distributions were used to obtain the error terms for the regression equation. 4. Model validation of the regression equations follow. This diagnostic checking requires to satisfy the following assumptions: (a) Linearity (b) Normality of the error terms (c) Independence of error terms (d) Constancy of Variance The results for the diagnostic checking of each regression equation used for this study were presented in the Appendix C. 5. The missing observations were replaced by the predicted value using the corresponding regression equation.

45

4.5
4.5.1

Comparison of Imputation Techniques
The Bias and Variance of the Estimates

The primary objective of using imputation techniques is to be able to generate statistically reliable estimates. To check if the imputation techniques produce reliable estimates and determine the effect of the varying nonresponse rates on the performance of imputation techniques, one of the three criteria which is the bias and the variance of the sample mean were measured.

To compute for the bias of the mean of the imputed data, the following procedures were implemented: ¯ 1. The mean of the responding units, y r was computed. For Hot Deck and Stochastic Regression Imputation, the average of all the mean of the 1,000 simulated data sets was computed. 2. The mean of the nonresponding units, ym was computed. ¯ 3. The resulting bias of the mean of the imputed data was computed by getting the difference between (1) and (2). For the Overall Mean and Deterministic Regression Imputation, the variance is zero. On the other hand, for Hot Deck and Stochastic Regression Imputation, the variance is given by:
1 V ar|y | = n s2 ¯ y

and

s2 = ¯ y

1 n−1

(yi − y )2

The results of this section will be presented in the next chapter.

46

4.5.2

Comparing the Distributions of the Imputed vs. the Actual Data

In order to determine which imputation method was able to maintain the same distribution of the actual data, a goodnesss - of - fit test was utilized. For this study, the researchers chose the Kolmogorov - Smirnov (K-S) test. The Kolmogorov - Smirnov is a goodness of fit test concerned with the degree of agreement between the distribution of a set of sampled (observed) values and some specified theoretical distribution (Siegel, 1988). In this study, the researchers were concerned with how the imputation methods affected the distribution of the FIES 1997 data.

The following steps are made for the Kolmogorov - Smirnov Test: 1. Income and Expenditure deciles were created. The creation of these deciles was based on the second visit actual FIES 1997 data. 2. The obtained deciles were used as upper bounds of the frequency classes. 3. A Frequency Distribution Table (FDT) for each trial was created. For this part, the researchers used the SPSS aggregate function to generate the FDT. 4. The FDT includes the Relative Cumulative Frequency (RCF) for both the imputed and actual distribution. RCFs are computed by dividing the cumulative frequency by the total number of observations. 5. The absolute value of the difference of the actual data RCF and the imputed RCF was computed. This was computed using Microsoft Excel

47 6. The test statistic for the Kolmogrov - Smirnov Test, which is the maximum deviation, D, was determined by using this formula: D = max|RCFimputed − RCFactual | 7. Since this is a large sample case and assuming a 0.05 level of significance, the critical value for this is computed using the formula:
1.36 √ , N

N = 4, 130

8. If D is less than the critical value, then the conclusion that the imputed data maintains the same distribution of the actual data follows. To provide additional information to the distribution of the imputed vs. actual data, the comparison of the frequency distribution of the actual (deleted) vs. imputed values was obtained. This was done in order to show the effect of the imputed values to the distribution of the data set.

In performing the test, the following steps are made: 1. Income and Expenditure deciles were created. The deciles that were used in the previous test were the same deciles used here. 2. The obtained deciles were used as upper bounds of the frequency classes. 3. A Frequency Distribution Table (FDT) for both the imputed and actual values was generated. 4. For Hot Deck and Stochastic Regression which had 1,000 sets the Relative Frequencies (RF) for each frequency class were averaged over 1,000 RFs. The results of this test were be presented in the next chapter.

48

4.5.3

Other Measures in Assessing the Performance of the Imputation Methods

Lastly, the researchers adopted measures used by Kalton (1983) in his report entitled Compensating for Missing Data for evaluating the effectiveness of imputation methods. These measures are: (a) Mean Deviation (MD), (b) Mean Absolute Deviation (MAD) and (c) Root Mean Square Deviation (RMSD).

The Mean Deviation (MD) measures the bias of the imputed values. This is represented by the formula: MD = (ˆmi − ymi ) y
m

, i = 1, 2..., m

where ymi is the imputed value for the variables TOTEX2 or TOTIN2 and ymi is ˆ the actual value of the variables TOTEX2 or TOTIN2 for case i = 1, 2..., m.

According to Kalton (1983), the Mean Absolute Deviation (MAD) is a criterion for measuring the closeness with which the deleted are reconstructed. This is represented by the formula: M AD = |(ˆmi − ymi )| y
m

, i = 1, 2..., m

where ymi is the imputed value for the variables TOTEX2 or TOTIN2, ymi is the ˆ actual value of the variables TOTEX2 or TOTIN2 for case i = 1, 2..., m.

49 The Root Mean Square Deviation (RMSD) is the square root of the sum of the square deviations of the imputed and actual observation. Same as the MAD, it measures the closeness with which the deleted values are reconstructed. This is expressed as: RM SD = (ˆmi − ymi )2 y
m

where ymi is the imputed value for the variables TOTEX2 or TOTIN2,ymi is the ˆ actual value of the variables TOTEX2 or TOTIN2 for case i = 1, 2..., m.

These three criteria for measuring the performance of the imputation techniques were implemented using the Decimal Basic program. After each imputation

method is performed, the program proceeds in finding the Mean Deviation, Mean Absolute Deviation and Root Mean Square Deviation and were saved in their corresponding Criteria for Expenditure (CRITEX) and Criteria for Income (CRITIN) files.

50

4.5.4

Determining the Best Imputation Method

To answer the primary objective of this study which is determining the best or the most appropriate imputation technique for FIES 1997, the researchers ranked the four imputation techniques based on the criteria discussed in the previous sections. The selection of the best method will be independent for all the variables of interest and nonresponse rates. The ranking of the imputation methods covered the following: Nonresponse Bias (NB), Estimated Percentage of Correct Distribution of the Imputed Data (PCD) which refers to the proportion, out of the total number of simulated data sets, that the imputed data set was able to reconstruct the actual data set, Mean Deviation (MD), Mean Absolute Deviation (MAD) and Root Mean Square Deviation (RMSD)

The procedure for ranking are as follows: 1. In each criteria mentioned above, the imputation methods were ranked using the scale of 1 to 4,with 1 indicating the best imputation method and 4 being the worst. 2. For each variable of interest (i.e. TOTEX2, TOTIN2),the obtained rankings of a particular imputation method for each criteria is added. 3. The imputation method with the lowest total will be considered as the best imputation method for the respective variable of interest and nonresponse rate. The results of the ranking procedure were presented in the next chapter.

Chapter 5

Results and Discussion
5.1 Descriptive Statistics of Second Visit Data Variables
Table 2 shows the descriptive statistics of the second visit variables of interests (VI), TOTEX2 and TOTIN2. This was computed to provide a brief idea on how much a household spends and earns in a period of time, measure the differences of the statistics between the two variables and to compare the results with other tests later on.

Table 2: Descriptive Statistics of the 1997 FIES Second Visit Variable TOTEX2 TOTIN2 Mean 102,389.8 134,119.4 Std. Dev 129,866.6 216,934.9 Min 8,926.00 9,067.00 Max 3,903,978 4,357,180 N 4,130 4,130

The average total spending of a household in the National Capital Region (NCR) is about Php 102,389.80 while the average total earnings amounted to P134,119.40,

52 a difference of more than thirty thousand pesos. it can be noted that the observations from the TOTIN2 have a larger mean and standard deviation as compared to TOTEX2. The dispersion can be also seen by just looking at the minimum at maximum of the two variables.

5.2

Formation of Imputation Classes

Table 3 shows the results of the Chi-Square Test of Independence where it was performed to determine if the candidate matching variables (MVs) are associated with the VIs. The MV stated in the methodology must be highly correlated to the variables of interestThe first visit VIs were used as the variables to be tested for association rather than second visit VIs since the second visit VIs already contained missing data.

The candidate MVs that were tested are the provincial area codes (PROV), recoded education status (CODES1) and recoded total employed household members (CODEP1).The candidate PROV has four categories and these are the following: 39, which is designated for Manila, while 74 is designated for NCR District 2. District 2 is comprised of Quezon City, Mandaluyong City, San Juan, Marikina and Pasig City. The code 75, which is NCR District 3 for PROV is designated for Caloocan, Malabon, Navotas and Valenzuela. The last category for PROV is 76, which is NCR District fourth that includes Makati, Las Pi˜as, Muntinlupa, n Para˜aque, Pasay, Taguig and Pateros. n

53 The candidate MV CODES1 has three categories. The original Education Status variable had 99 categories, hence, the researchers reduced these categories and categorize them further into smaller groups to reduce the heterogeneity and the bias of the estimates. The recoded MV CODES1 were indicated as 1 for respondents which indicated responses from No Grade Completed until High School Graduate for its educational attainment; 2 for respondents that answered as College Undergraduate or College Graduate as its educational attainment; 3 for respondents which had an educational attainment higher than a Bachelor’s Degree.

CODEP1 has also four categories. The original Total Employed Household Members variable had 7 categories and like the Education Status variable, this was reduced to smaller groups. The recoded MV CODEP1 were indicated as 0 for households with no employed members, 1 for households with one to two employed members, 2 for households with three to four employed members and 4 for households with 5 or more employed members.

54 Table 3: Results for the Chi-Square Test of Independence for the Matching Variables

The Chi-Squared test of association for the candidates and the variables of interest showed that PROV, CODES1 and CODEP1 are associated to CODIN1 and CODEX1. The p-values for all the candidates were less than 0.0001 indicating that the association is very significant. The results of succeeding measures of association will determine which of the three candidates will be chosen as the MV of the study.

55 Table 4 shows the other measures of association, namely, the Phi-Coefficient, Cramers V and the Contingency Test. These tests were done in order to assess the degree of association of the candidates to CODIN1 and CODEX1.

Table 4: Tests of Association for Matching Variable: Degree of Association

The degree of association for all the tests showed small measures association with variables CODIN and CODEX. This kind of result is expected in real complex data, given larger variability among the observations. From Table 4, it is clearly shown that the CODES1 is the MV which exhibit the largest association among the variables and therefore, the MV that can ensure that the ICs are homogeneous. Thus, CODES1 is the chosen MV for this data.

To have a detailed description of the CODES1 imputation classes, the descriptive statistics for each imputation class was obtained. Table 5 shows the descriptive statistics of each imputation class of the data. The descriptive statistics will tell if the best MV decreases the variability of the observations. In checking for the variability of each imputation class, the standard deviation will be used and compared

56 with the value from the overall standard deviation of the variables of interest.

Table 5: Descriptive Statistics of the Data Grouped into Imputation Classes.

The table shown above indicates that IC1 is the imputation class with the smallest standard deviation. The two ICs, IC2 and IC3 produced large standard deviations however it is being neutralized by a low value from IC1 which has the largest proportion of the data. A possible reason why the standard deviation and the mean of IC3 are large is because majority of the extreme values were contained on that class.

57

5.2.1

Mean of the Simulated Data by Nonresponse Rate for Each Variables of Interest

Results in Table 6 show the means for both second visit VIs, TOTEX2 and TOTIN2, under all NRR. This was generated to be used an input in the comparison of the mean from the imputed data for each IM.

Table 6: Means of the Retained and Deleted Observations

The mean of the observations set to nonresponse and observations retained showed contrasting results. For both variables, TOTEX2 and TOTIN2, When the nonresponse rate increases, the mean rate of observations set to nonresponse also increases. Conversely, the mean of observations retained decreases when nonresponse rate increases. Perhaps the large values that were set to nonresponse increased the means of the data sets containing nonresponse for the varying rates of nonresponse. Hence, as the number of missing values increases, the deviation between the means of the actual and retained data slowly increases.

58

5.2.2

Regression Model Adequacy

Table 7 show the different regression models for all VIs and nonresponse rates (NRRs) that were checked for adequacy. The columns are represented as follows: (a) VI, (b) the nonresponse rate (NRR), (c) IC, (d) the prediction model, (e) the coefficient of determination (R2 ) and (f) the F-statistic and its corresponding p-value indicated by the values in parenthesis.

For the notations used in Table 7, the codes IC1, IC2, IC3 represents the first, second and third imputation class respectively. Meanwhile, for the regression equations used for the regression imputation, yi represents the dependent variˆ able, which is the predicted second visit value for variable TOTIN2 or TOTEX2. Logarithmic transformations were utilized in order to correct the non-linearity for the regression equations. The code (LN F V E1i ) is the logarithmic transformation of the first visit observation for the variable Total Expenditure (TOTEX1) under the First Imputation Class. Similarly, (LN F V I1i ) is the logarithmic transformation of the first visit observation for the variable Total Income (TOTIN1)under the First Imputation Class. The same notation also applies for (LN F V E2i ) and (LN F V E3i ) under the Second and Third Imputation Class for the variable TOTEX1 and (LN F V I2i ) and (LN F V I3i ) under the Second and Third Imputation Class for the variable TOTIN1.

59 Table 7: Model Adequacy Results

60 Table 7 showed the regression models used for the regression imputations under their respective VIs and ICs. Before using these equations for imputating missing values, diagnostic checking of the models, which include Linearity, Normality of Error Terms, Independence of Error Terms and Constancy of Variance.

First, the researchers looked at the coefficient of determination or R2 of each regression equation in order to determine the explanatory power of first visit VI to the second visit VI. A large value of R2 is a good indication on how well the model fits the data. The highest R2 in Table 7 measured 93.2% (The equation under TOTEX2,IC3 with 30% nonresponse rate). Meanwhile, the lowest coefficient of determination can be found at the equation with the variable TOTIN2, under IC1 with 20% NRR, which had an R2 of 70.3%. For all NRR and VIs, the third IC generated the highest R2 while the first IC produced the lowest R2 .

Second, the models were checked if they satisfy the assumption of linearity. This was performed using the ANOVA tables presented in Appendix C. The results of the diagnostic checking showed that all models exhibited the assumption of linearity. The p-values for all the models were less than 0.0001, an indication that the linearity of the models is very significant.

Third, the next phase for diagnostic checking is to check if the regression model satisfy the assumption of normality. For this study, the researchers examined the Normal Probability Plot(NPP) of the regression models. The normal probability plot in all models moderately follows the S-shaped pattern which indicates that

61 the residuals are not normal but rather lognormal. However, the shape of the NPP improved after ln transformation was applied even though the model was not linear previously. Since the data used is a complex data, the models were used even if assumption of the residuals to be normal is not perfectly achieved.

Fourth, in testing for the assumption of independence of error terms, the Durbin Watson test was implemented. Results in Appendix C show that all of the models satisfy the assumption of independence.

Lastly, to check if the residuals satisfy homoscedasticity or the equality of variances, a scatter plot of the residuals against the predicted values was obtained. Results showed that there were no distinct patterns evident in the scatter plot. The logarithmic transformation resolved the problem of heteroscedasticity.

Hence, given this discussion, the results show that the assumptions for the diagnostic checking of the regression equations used for the regression imputations are satisfied.

62

5.2.3

Evaluation of the Different Imputation Methods

In the evaluation of the different imputation methods (IMs), the results of each IM will be discussed independently. For each IM, the discussion of results will go as follows: (1) nonresponse bias and variances of the estimates of the population of the imputed data, (2) distribution of the imputed data using the KolmogorovSmirnov Goodness of Fit Test, and (3) other measures of variability using the mean deviation (MD), mean absolute deviation (MAD) and root mean square deviation (RMSD).

The table of results will contain the following columns: (a) VI, (b) NRR, (c) the bias of the population mean of the imputed data, Bias(ˆ ), (d) the variance y of the population mean of the imputed data, Var(ˆ ), (e) Estimated percentage y of correct distribution of the imputed data set to the actual data set (PCD), (f) Mean Deviation (MD), (g) Mean Absolute Deviation (MAD) and (h) Root Mean Square Deviation (RMSD).

Overall Mean Imputation Table 8 shows the results of the different criteria in evaluating the newly created data with imputations using the overall mean imputation (OMI) method.

Table 8: Criteria Results for the OMI Method

1. Nonresponse Bias and Variance

63

In (c) of Table 8, results show that for nonresponse bias, as the nonresponse rate increases for both VI, the value of the bias decreases. The decrease in value of the bias in TOTIN2 was faster and more dramatic than TOTEX2. It seemed that in TOTIN2, the extent of the decrease in value are almost 500% under 20% NRR and almost tripled the rate of decrease under twenty percent NRR for the highest NRR. In contrast of the results in TOTIN2, the extent of decrease of the bias for TOTEX2 is much slower. The biases of the 20% and 30% for TOTIN2 is more than 6 times larger than TOTEX2.

The variance for all NRR and VI are all zero because the population mean of the imputed data set is constant. The data was not simulated one thousand times unlike for hot deck imputation (HDI) and stochastic regression imputation (SRI). Further, the OMI method did not create a sampling distribution for the mean of the created data due to a single simulation.

2. Distribution of the Imputed Data

64 Results in column (e) of Table 8 showed that in all nonresponse rates and variables, the OMI method failed to maintain the distribution of the actual data. This was expected primarily because in each missing observation from all data sets with missing data, the missing observations were replaced by a single value which is the overall mean of the first visit of the VI.

Results from other studies stated that the OMI is one of the worst among all imputation methods. It is remarked that even if it is a simple process, inaccurate results are obviously made. Cases that vary significantly to the imputed values were the primary cause for inaccuracy. Also, the use of only a single value to be imputed for the missing data distorts the distribution of the data. The distribution of the data becomes too peaked which makes this method unsuitable for many post-analysis. (Cheng, 1999)

3. Other Measures of Variability The three criteria in Table 8 under the columns (f), (g) and (h) show the other measures of variability of the imputed data. In all the criteria, the values for TOTEX2 are increasing as the nonresponse rate increases. However, this is not the case for TOTIN2. Suprisingly, the data which have twenty percent nonresponse observation that were imputed have the highest values for the three criteria.

It is worth noting to see that the mean deviation that focuses on each observation showed contrast with the results of the bias which focused on the

65 population mean of the imputed data. The mean deviation for all nonresponse rates under the TOTEX2 variable were overestimating the actual data however in the results of bias. On the other hand, the population mean of the imputed data underestimates the actual data. Likewise in the other variable, when the result in mean deviation is an underestimate, the result from the bias is just the opposite which is an overestimation.

5.2.4

Hot Deck Imputation

Table 9 shows the results of the different criteria in evaluating imputed data with imputations using the hot deck imputation (HDI3) method with three imputation classes. Table 9: Criteria Results for the HDI Method

1. Nonresponse Bias and Variance Similar in the results of the OMI method, the bias of the population mean

66 of the imputed data increases for both variables as the NRR increases. As seen in OMI, for the TOTIN2 variable, the bias of the data which has twenty percent imputations is more than four times the bias of the data which contained ten percent imputed and almost half the bias of the data which has thirty percent imputed. The bias in the TOTIN2 variable in this method is a little worse than the OMI method.

Similar results were seen in OMI for the other variable, TOTEX2 where in the data which contained 30% imputations, the bias becomes negative. The bias seemed to decrease in value as the NRR increases. The biases for the first and second NRR under HDI3 performed better than OMI.

The variance of the population mean of the data which have imputations increases by more than one hundred percent as the nonresponse rate increases. The data which contained the lowest number of imputations provided the least spread of the population means and the data which contained the largest number of imputation provided the worst spread.

2. Distribution of the Imputed Data Results in column (e) shows that in TOTIN2, the imputed data maintained the distribution of the actual data for the data which contained ten and twenty percent imputations. On the other variable, only the data which contained ten percent imputation provided maintained the distribution of the actual data for all the one thousand data set. In the data which contained

67 twenty percent imputations, only 969 out of the 1000 data set maintained the distribution of the actual data.

In the data sets which contained the largest number of imputations, both variables failed to maintain the distribution of the actual. Much worse, none of the simulated data set for TOTEX2 registered the same distribution as the actual. On the other hand, only a lone data set maintained the same distribution as the actual. The researchers look into the possibility that more than one recipient are having the same donor or could be that majority of the imputations are coming from one particular area in the record.

3. Other Measures of Variability For the three remaining criteria, the values generated were better than the results in the OMI method. In the MD criterion for both variables, the MD criterion generated an underestimation of the actual observation. While the OMI method overestimates the deleted actual values for the TOTIN2 variable, the HDI3 underestimates them. The underestimation rapidly increases as the nonresponse rate increases. The magnitude of the MD for TOTIN2 is larger for HDI3 than in OMI for all nonresponse rates. Similar to the results in MD for TOTIN2, the MAD and RMSD were unusually large compared to the OMI. In seems that imputation classes for the TOTIN2 variable were not as effective as compared to the TOTEX2 variable wherein in majority of values in all the nonresponse rates and criteria showed that HDI3 was better than OMI.

68

5.2.5

Deterministic Regression Imputation

Table 10 shows the results of the different criteria in evaluating the imputed data using the deterministic regression imputation method with three imputation classes (DRI).

Table 10: Criteria Results for the DRI Method

1. Nonresponse Bias and Variance Looking at Table 10, the bias for all NRR and VI showed negative results which indicates that the population mean of the imputed data is underestimated. The results in the nonresponse bias from this method are similar to the results of the previous two methods that the TOTIN2 is underestimated. However, not like the results in OMI and HDI which the bias increases tremendously as the nonresponse rate increases, the increase in

69 bias for this method is much slower. The bias of the data which has twenty percent imputations of the imputed data set is just twice the bias of the data set which has a lower percentage of imputations. For the TOTEX2 variable, this method produces more biased estimates for all NRR than the two previous methods.

As in the OMI method, the variance for this method is also zero since the population mean is constant due to a single simulation of the missing observations.

2. Distribution of the Imputed Data Contrary to the results of the OMI method under this criterion, the DRI maintained its distribution for all the NRRs and VIs. It is even much better than the HDI since all of the imputed data sets under all the NRRs and VIs preserved the same distribution as the actual data. It is interesting to note that the regression models that were used in this study did not show the expected results that were mentioned in the related literature and provided a distinct result. Earlier studies that made use of categorical auxiliary variables, variables that are known to be the matching variables in this study, conclude that deterministic regression is just the same as the mean imputation to generate distorted and peaked distributions. However, in this study, the independent variable was the first visit VIs and for each imputation class there is a fitted model which registered better R2 that made the difference.

70 3. Other Measures of Variability Similar to the results in the nonresponse bias, the MD for all VI and NRR underestimates the actual observations. The underestimation for all NRR is almost stable because the rate of change is very small as compared to the two previous IMs. The MAD and RMSD show better results than OMI and HDI providing closer values of the imputed to the actual observations. As seen in OMI and HDI, the TOTIN2 have larger values for the MAD and RMSD criteria. Fitting models with high R2 was the key factor that made this method better than the other two IM previously evaluated.

5.2.6

Stochastic Regression Imputation

Table 11 shows the results of the different criteria in evaluating the imputed data using the stochastic regression imputation method with three imputation classes (SRI). Table 11: Criteria Results for the SRI Method

1. Nonresponse Bias and Variance The only method that produced reasonable estimates is the SRI method. The random residual added to the deterministic predicted observation made the difference. Clearly, there is no relationship between the nonresponse bias estimates of the population mean and the nonresponse rate. The biases fluctuate from one nonresponse rate to the other. This method provided the least bias in the highest nonresponse for both TOTEX2 and TOTIN2. While

71

the other methods reached a four digit bias, the SRI generated a much lesser bias than the other three methods. Moreover, there is a huge disparity in the third nonresponse rate wherein it only produced less than twenty percent of the bias produced by its deterministic counterpart.

The variances of the SRI proved to be much better than its model-free counterpart which is the HDI. In all the methods and nonresponse rate, it is clearly seen that there is a huge disparity between the variances of the SRI and HDI. Variances from the HDI are almost ten times larger compared to SRI.

2. Distribution of the Imputed Data Results from the SRI performed better than its model-free counterpart that is the HDI method which also simulated the data 1000 times. Unlike in hot deck imputation, stochastic regression imputation maintained the same distribution for all imputed data sets for the first and third nonresponse rates.

72 It also outperformed the former in the second nonresponse rate, TOTEX2 variable. One of the reasons why 16 out of the 1000 sets failed to maintain the distribution of the actual data set for the imputed data set which contained twenty percent or 826 imputations might be the unfeasibility of the predicted values.

In earlier studies, the stochastic regression imputation performs better than any of the four methods used here. The random residual was added to the deterministic predicted value to preserve the distribution of the data. However, even if the original deterministic imputed values were feasible, the stochastic counterpart need not be. After adding the residual to the deterministic imputation, unfeasible values could namely result. (Nordholt, 1998)

3. Other measures of variability Similar to the results in the nonresponse bias, the MD has no relationship with the NRR since from one NRR to another, the MD fluctuates. In the same criteria, it outperformed its regression counterpart but also getting outperformed by the two other methods. Contradictory to the results and observations in the MD criteria, the SRI closely follows second to the DRI3 methods and provides better values than the two other methods.

In the review of related literature, the stochastic regression performs way better than the deterministic regression. The researchers look at the same reason from the previous criteria. Its likely possible that the predicted val-

73 ues are unrealistic as compared to the deterministic predicted value.

After comparing the different methods with the criteria proposed in the methodology, the distribution of the true values (TVs) that were deleted and the imputed values (IVs) from each of the imputation procedures for all the VIs and nonresponse rates were computed. Table 11, 12 and 13 shows the frequency distribution of the methods with their corresponding relative frequencies (RFs) for the first, second and third nonresponse rates respectively. The RFs for the 1000 simulated data set from HDI and SRI were averaged. The first column represents the VIs frequency classes. This was the same classes that were used in the Kolmogorov-Smirnov Goodness of Fit test in determining the estimated percentage of similar distributions of the imputed data. The second column is the relative frequencies of the actual data. The succeeding columns are the imputation methods.

Table 12: Distribution of the True Values and Imputed Values from the imputation procedures: 10% NRR

74

Table 13: Distribution of the True Values and Imputed Values from the imputation procedures: 20% NRR

75

Table 14: Distribution of the True Values and Imputed Values from the imputation procedures: 30% NRR

76

For the actual and imputed data with the lowest number of observations set to missing, it clearly illustrates the distortion of the distribution created by the OMI method. The OMI method assigns the mean of the first visit VI to all the missing cases, as a result, all the distribution of the missing values replaced by a single value concentrates at one frequency class. The three methods which implemented imputation classes, gave a better outcome than OMI by spreading the distribution of the imputed data.

77 For the HDI method, in all nonresponse rates, most of the imputed observations clustered in the first frequency class, that is less than 37859.5 for TOTEX2 and 40570 for TOTIN2. The clustering was also formed for the first and third nonresponse rate in last frequency class for TOTEX2 and for the all nonresponse rates in second frequency class for TOTIN2. The percentage of the data in from the lowest class for TOTEX2 and TOTIN2, for all nonresponse rate ranges from 14-16% compared to the actual percentage which only ranges from 9-11%.

While there is an over representation of the data, an under representation was observed from the interval 86103-126254.5 for the 10% and 20% nonresponse imputed data sets respectively and from the interval 63265-101947 for the 30% nonresponse imputed data sets. The percentage from the interval indicated for the 10% and 20% under the actual data totaled about 30% while the imputed data only totaled less than 30%.

For the two regression imputation methods, unlike hot deck and OMI which had major cluster, produced more spread distribution although there are some areas that are under represented. The failure to consider a random residual term in deterministic regression resulted into a severe under representation of the data in particular the first frequency class. On the other hand, the SRI which considered a random residual provided better results than DRI. However, there are some areas that the added random produced significant excess mostly from the last frequency class.

78

5.3

Choosing the Best Imputation

For this section, the rankings of all the tests are the basis to determine which of the following IMs will be chosen as the best IMs for this particular study and data. The selection of the best method will be independent for all VIs and NRRs. The ranking are based on a four-point system wherein the rank value of 4 denotes the worst IM for that specific criterion and 1 denotes the best IM for that criterion. In case of ties, the average ranks will be substituted. The IM with the smallest rank total will be declared the best IM for the particular VI and NRR. The ranking of IM will cover the following criteria: (a) Nonresponse bias, (b) Distribution of correct distributions, and (c) Other measures of variability. All in all, there are five criteria that each IM will be rank in.

Tables 14, 15 and 16 show the ranking of the different imputation methods for the 10%, 20% and 30% NRR respectively. The table is divided into six columns. The first column represents the VI, second is the criteria, third up to the sixth column are the imputation methods.

79 Table 15: Ranking of the Different Imputation Methods: 10% NRR

80 Table 16: Ranking of the Different Imputation Methods: 20% NRR

81 Table 17: Ranking of the Different Imputation Methods: 30% NRR

82 Rankings show that the two regression imputation methods provided better results than their model-free counterparts. For all the nonresponse rates under the TOTIN2 variable, the two regression methods tied as the best imputation method, and surprisingly the HDI finished the worst imputation method behind OMI. Under the TOTEX2 variable, mixed rankings were seen for all nonresponse rates. The regression methods still provided good results. The SRI method finished first in the 10% and 30% NRR and ranked third in the 20% NRR while the DRI method finished third, first and second in the 10%, 20% and 30% NRR respectively. While the HDI was seen as the worst IM for TOTIN2, the OMI was concluded the worst IM for TOTEX2 by ranking last for both 10% and 20% NRR and third for the 30% NRR.

In conclusion, the best imputation method for this study is the Stochastic Regression Imputation using the 1997 FIES data. It is very closely followed by the Deterministic Regression Imputation. No records in the results show that SRI method ranked last in all the criteria, NRRs and VIs, unlike for DRI which provided the worst IM in the nonresponse bias and Mean Deviation criteria. The researchers selected the HDI as the worst IM in this study. The HDI method fared the worst such that majority of the results in the different criteria under each NRR and VI in particular the said method rated poorly.

Chapter 6

Conclusion
This paper discussed a range of imputation methods to compensate for partial nonresponse in survey data and showed empirical proofs on the disadvantages and advantages of the methods. It showed that when applying imputation procedures, it is important to consider the type of analysis and the type of point estimator of interest. Whether the researcher’s goal is to produce unbiased and efficient estimates of means, totals, proportions and official aggregated statistics or a complete data file that can be used for a variety of different analyses and by different users, the researcher should clearly identify first the type of analysis that will suit his or her purpose. In addition, several practical issues that involve the case of implementation, such as difficulty of programming, amount of time it spends and complexity of the procedures used must also be taken into consideration.

Anyone faced with having to make decisions about imputation procedures will usually have to choose some compromise between what is technically effective and what is operationally expedient. If resources are limited, this is a hard choice. This study aims to help future researchers in choosing the most appropriate im-

84 putation technique for the case of partial nonresponse.

For our particular implementation, all of the methods were run to a programming language due to the unavailability of software that can generate imputations for all the methods needed ofr this study. In all of the methods, the overall mean imputation was the easiest to use and create a computer program. The other three methods required the formation of imputation classes. Both regression imputations were the hardest to program and the most time consuming imputation methods.

The performance of several imputation methods in imputing partial nonresponse observations was compared using the 1997 Family Income Expenditure Survey (FIES) data set. A set of criteria were computed for each method based on the data set with imputed values and data set with actual values to find the best imputation method. The criteria in judging the best method were the bias and variance estimates of the imputed data, the preservation of the distribution by the actual data, and the other measures of accuracy and precision incorporated from the study of Kalton (1983).

The results show that the choice of imputation method significantly affected the estimates of the actual data. The similarities among the two best methods, namely, the Deterministic and Stochastic Regression imputation methods were due in part to the adequacy and prediction power of the models.

85 The bias and variance estimates of the imputed data obtained appeared to vary much across imputation methods and it was unexpected that the Hot Deck Imputation method rendered the highest estimates in majority of the nonresponse rates as well as its variables. Stochastic Regression, on the other hand, was the best method in that particular criterion since in majority of the results in the tests produced relatively small biases and variances.

The distributions of the imputed data of each method were checked for the preservation of the distribution using the Kolmogorov-Smirnov Goodness of Fit test. In the methods used in this study, both regression imputation methods retained the distribution of the data especially the Deterministic Regression Imputation that generated exactly the same distribution as the actual data.

In the other tests of accuracy and precision, namely, the mean deviation, mean absolute deviation and root mean square deviation, the different methods provided mixed results in all nonresponse rates. The results for some methods did not consistently and clearly yielded good results. Only half of the methods used provided great results in one particular criterion which is the preservation of the distribution of the data. In the other results, inconsistency was obviously seen due to the alternating rankings from each method.

Given the criteria and procedures in judging the best imputation procedure among the four methods, the selection of the best method was difficult. Consequently, in order to determine the best method of imputing nonresponse observation for

86 each variable in the study, the methods were ranked according to several criteria. Methods that were ranked 1 indicate as the best imputation method while methods ranked 4 shows that it is the worst in that particular criterion.

After comparing the methods, the two regression method namely the Deterministic and Stochastic Regression Imputation gave the outstanding results. The researchers concluded that the Stochastic Regression Imputation procedure is considered the best imputation method for this study since the it did not rank poorly in any criteria under all NRRs and VIs.

The efficiency of the imputation method was supported by the R2 of the model and the added random residual in the deterministic imputed value. The random residuals added to the deterministic imputation provided a change in making the estimates less biased than its deterministic counterpart.

Deterministic regression imputation method performed much better than Hot Deck imputation method. It is surprising that the Hot Deck imputation method was less efficient than deterministic regression where in the related studies; it emerged as the better method than deterministic regression. Most likely the selection of donors with replacement caused its poor performance and not the imputation classes. If the imputation classes were the cause of its low ranking, then both regression imputation methods estimates could be as worse as the Hot Deck imputation even if the model is adequate.

Chapter 7

Recommendations for Further Research
In this study, we have compared four imputation methods commonly used in dealing with partial nonresponse data and with the assumption of MCAR. However, there are other methods that are currently being developed and improved. For example, the multiple imputation method involves independently imputing more than one value for each nonresponse value. Multiple imputation is an important and powerful form of imputation and has the advantage that variance estimation under imputation can be carried out comparatively easily. (Kalton, 1983)

Regarding the variance estimation, further studies should implement the use of proper variance estimates like the Jackknife variance estimator. This variance estimator is more often used in comparing the variance estimates of most imputation methods. The study of Rao and Shao (1992) has proposed an adjusted Jackknife variance estimator to use with the imputation methods related to the Hot Deck imputation procedure. This variance estimator is said to be asymptotically unbiased.

Future researchers may test other methods on the same data set and compare

88 the results with those presented in this paper. They could also compare the results of this study with those of multiple imputation and the Rao-Shao jackknife variance estimator. There is a need, however, for a higher knowledge in statistics and Bayesian statistics in using the above procedures. The complexity of the methods especially both regression imputations could hinder future researchers in the use of modern variance estimator.

It is also suggested that the use of a method to select a matching variable through the use of advanced modern statistical methods like the CHAID analysis. The acronym CHAID stands for Chi-squared Automatic Interaction Detector. It is one of the oldest tree classification methods originally proposed by Kass (1980; according to Ripley, 1996, the CHAID algorithm is a descendent of THAID developed by Morgan and Messenger, 1973). CHAID will ”build” non-binary trees (i.e., trees where more than two branches can attach to a single root or node), based on a relatively simple algorithm that is particularly well suited for the analysis of larger datasets. Also, because the CHAID algorithm will often effectively yield many multi-way frequency tables (e.g., when classifying a categorical response variable with many categories, based on categorical predictors with many classes), it has been particularly popular in marketing research, in the context of market segmentation studies. (Statsoft, 2003)

In pursuing regression imputation, instead of creating models for each imputation class that can really be time-consuming at the same time frustrating since not all models will have the same result, dummy variables should be inserted in

89 the model. These dummy variables are the categories of the matching variables. It would definitely save time and money since only one model is created and tested.

These researchers strongly recommend using a statistical package that can generate faster and a lot easier imputations but generate less biased estimates than programming. It would definitely save time than creating a computer program that eats up a majority of the research time in debugging and prevent computer crashes due to computer memory overload.

Bibliography
[1] Cheng, J.H. and Sy, F. ,A Comparison of Several Techniques of Imputation on Clinical Data (Undergraduate Thesis, De La Salle University) 1997.

[2] Kalton, G, (1983) Compensating for Missing Survey Data, Michigan.

[3] Musil, C., Warner, C., Yobas, P. K. and Jones. S. A Comparison of Imputation Techniques for Handling Missing Data, Western Journal of Nursing Research. Vol.24, No.7, 815-829 (2002)

[4] National Statistics Office (NSO).)(1997 - 2005). Technical Notes on the 1997 Family Income and Expenditure Survey (FIES). Retrieved 18 June 2007, from http://www.census.gov.ph/data/technotes/notefies.html

[5] Netter, J., Wasserman, W. and Kutner, M.H.. Applied Linear Statistical Models 2nd ed. Homewood, Illinois: Richard D. Irwin, Inc.

[6] Nordholt, E.S. (1998): Imputation: Methods, Simulation, Experiments and Practical Examples, International Statistical Review, Vol.66, No. 2, 157180.

91 [7] Obanil, R. (2006, October 3). Topmost floor of NSO Building Gutted by Fire. The Manila Bulletin Online. Retrieved 28 August 2007, from http://www.mb.com.ph/issues/2006/10/03/MTN2061037203.html [8] Salvino, S. and Yu, A. C. Some Approaches in Dealing With Nonresponse in Survey Operations With Applications to the 1991 Marinduque Census of Agriculture and Fisheries Data (Undergraduate Thesis, De La Salle University)(1996) [9] Siegel, S.(1988).Nonparametric Statistics for the Behavioral Sciences.New York: Mc Graw - Hill [10] No tronic author. Statsoft CHAID Textbook. Analysis [Electronic 29 version], 2007, Elecfrom

Retrieved

July

http://www.statsoft.com/textbook/stchaid.html [11] StatSoft, Inc. STATISTICA (data analysis software system), version 7.1. www.statsoft.com.(2005)

Appendix
Appendix A Items and Information Gathered in the FIES 1997

93

Appendix B Source Codes of the Imputation Programs

94

Appendix C Model Validation of the Regression Equations used in the Regression Imputation Procedures