You are on page 1of 8

1.

2 Statement of the Problem
This paper attempts to answer the following questions: 1. Which imputation technique is the most appropriate in handling partial nonresponse for the FIES data? 2. How do varying nonresponse rates affect the results for each imputation method?

1.3 Objectives of the Study
The paper will attempt to achieve the following objectives: 1. To compare the imputation techniques namely overall mean imputation, hot deck imputation, deterministic and stochastic regression imputation, in compensating partial nonresponse in the FIES. 2. To investigate the effect of the varying rates of missing observations, particularly the effect of 10%, 20% and 30% nonresponse rates on the precision of the estimates.

1.4 Significance of the Study
Nonresponse is a common problem in conducting surveys. The presence of nonresponse in surveys causes to create incomplete data, which could pose serious problems during data analysis, particularly in the generation of statistically reliable estimates. For this reason, the use of imputation techniques enables to account for the difference between respondents and nonrespondents. This then helps reduce nonresponse bias in the survey estimates.

Since most statistical packages require the use of complete data before conducting any procedure for data analysis, the use of imputation techniques can ensure consistency of results across analyses, something that an incomplete data set cannot fully provide.

In a news article by Obanil (2006) entitled Topmost Floor of the NSO Building gutted by Fire posted at Manila Bulletin Online, it mentioned that last October 3, 2006 around 1 Million Pesos worth of documents were destroyed by the fire. Among the documents gutted by the fire is the first-visit questionnaire of the FIES for the NCR which at the time of the fire has not yet been encoded.

In terms of statistical research, most countries in the developing world such as the United States, Canada, UK and the Netherlands already employ imputation techniques in their respective national statistical offices. In a country such as the Philippines, where data collection is very difficult especially for some regions like the National Capital Region (NCR), imputation will be able to ease the problem of data collection and nonresponse.

More importantly, given the great impact of this survey to the country, employing imputation techniques will help statisticians in providing a method in handling nonresponse, which could lead to a more meaningful generalization about our country’s income distribution, spending patterns and poverty incidence. Hence, having estimates with less bias and more consistent results, this can contribute in making our policymakers and economists provide better solutions in improving the lives of the Filipinos.

1.5 Scope and Limitations
Throughout this paper, only the data from the 1997 Family Income and Expenditure Survey (FIES) will be used to tackle the problem of nonresponse and to examine the impact of the different imputation methods applied in the dataset. With regards to the extent of how these imputation methods will be applied and evaluated, this paper will only cover the partial nonresponse occurring in the National Capital Region (NCR) since NCR is noted as the region with highest nonresponse rate. Also, the variables that will be imputed for this study would be the Total Income (TOTIN2) and Total Expenditures (TOTEX2) of the second visit of the FIES data.

The researchers will only focus on using the 1997 FIES data on the first visit to impute the partial nonresponse that is present on the second visit. This paper also assumes that the first visit data is complete and the pattern of nonresponse follows Missing Completely at Random (MCAR) case. The MCAR case happens if the probability of response to Y is unrelated to the value of Y itself or to any other variables; making the missing data randomly distributed across all cases (Musil et. al, 2002). If the pattern on nonresponse does not satisfy the MCAR assumption, imputation methods may not achieve its purpose.

As for the imputation techniques, only four imputation methods will be applied for this paper namely: Overall Mean Imputation (OMI), Hot Deck Imputation (HDI), Deterministic Regression Imputation (DRI) and Stochastic Regression Imputation (SRI). Other methods of handling nonresponse will not be covered in this paper.

On the aspect of evaluating the efficacy and appropriateness of the four imputation methods, this will only be limited to the following: (a) Bias of the mean of the Imputed Data, (b) Assessment of the Distributions of the Imputed vs. the Actual Data and (c) the criteria mentioned in the report entitled Compensating for Missing Data (Kalton, 1983) namely the Mean Deviation, Mean Absolute Deviation and the Root Mean Square Deviation. 5.2 Formation of Imputation Classes PROV (Provincial Area Codes) Classes Scope 39 Manila 74 Quezon City Mandaluyong City San Juan Marikina Pasig City 75 Caloocan Malabon Navotas Valenzuela 76 Makati Las Pinas Muntinlupa Paranaque Pasay Taguig Pateros

CODEP1 (Recoded Total Employed Household Members) Classes Scope 0 No employed members One to two employed 1 members Three to four employed 2 members At least five employed 3 members

CODES1 (Recoded Education Status) Classes Scope No grade completed until 1 High School Graduate College undergraduate or 2 college graduate Educational attainment 3 higher than a bachelor's degree Table 3:

Note: ….. CODIN stands for coded income for the first visit while CODEX1 stands for coded expenditure for the first visit.

Candidate MV PROV CODES1 CODEP1

PhiCramer's Contingency Coefficient V Coefficient CODIN1 CODEX1 CODIN1 CODEX1 CODIN1 CODEX1 0.192 0.183 0.111 0.105 0.188 0.18 0.386 0.408 0.273 0.288 0.36 0.378 0.295 0.216 0.17 0.125 0.283 0.211

p.56 (changed font and font size) Descriptive Statistics VI IC IC1 TOTIN2 IC2 IC3 IC1 TOTEX2 IC2 IC3 Mean 93588.3 2 186940. 9 643191. 2 74866.6 8 135510. 8 413184. 0 Minimum Maximum 9067.000 14490.00 54790.00 9025.000 13575.00 40505.00 1340900 4215480 4357180 731937.0 3203978 2726603 Std. Dev 75619.5 2 281852. 3 829409. 3 47517.6 9 151984. 3 532577. 1 Valid n 2635 1434 61 2635 1434 61

p.57 (edited, changed font and font size) Observations retained n Mean 3717 102748.610 3304 102219.791 2891 100709.947 3717 134821.662 3304 133624.722 2891 130685.596 Observations set to nonresponse (deleted) n Mean 413 99160.235 826 103069.697 1239 106309.365 413 127799.121 826 136098.155 1239 142131.636

VI

NRR

10% TOTEX2 20% 30% 10% TOTIN2 20% 30%

Table 9:

(a) VI TOTEX2

TOTIN2

(b) NRR 10% 20% 30% 10% 20% 30%

(c) BIAS( y ' ) 491.91 179.42 -606.37 -717.52 -3095.41 -6508.65

(d) PCD 100.00% 96.90% 0.00% 100.00% 100.00% 1.00%

(e) MD 4919.40 897.18 -2021.19 -7175.25 -15477.09 -21695.52

(f) MAD 78071.61 78292.63 81395.79 105369.15 111748.04 115087.13

(g) RMSD 79251.22 67149.16 71390.65 242022.99 297151.50 313814.92

Table 10:

(a) VI TOTEX2

TOTIN2

(b) NRR 10% 20% 30% 10% 20% 30%

(c) BIAS( y ' ) -720.46 -1469.57 -2266.38 -1128.45 -2211.82 -4137.78

(d) (e) PCD MD 100.00% -7204.56 100.00% -7347.86 100.00% -7554.61 100.00% -11284.46 100.00% -11059.09 100.00% -13792.60

(f) MAD 23839.82 23231.65 24082.88 32115.80 35274.03 34537.36

(g) RMSD 57726.62 53180.02 59795.67 77228.48 114957.43 103253.12

Table 11:
(a) VI TOTEX2 (b) NRR 10% 20% 30% 10% 20% 30% (d) PCD 536.32 100.00% 1080.12 98.40% 398.39 100.00% 897.11 100.00% -1815.39 100.00% 356.50 100.00% (c) (e) MD 5363.47 5400.71 1328.06 9043.98 -9076.98 1188.31 (f) MAD 33683.48 33782.60 32449.49 51363.17 57429.24 51886.73 (g) RMSD 70553.64 72487.39 72803.60 106374.39 148278.49 131429.61

BIAS( y ' )

TOTIN2

Figure 5:

100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% TV OM I HDI3* DRI3 SRI3* <37869.5 37869.5 - 47056.5 47056.5 - 54922.0 54922.0 - 62365.0 63265.0 - 73868.0 73868.0 - 86103.0 86103.0 - 101947.0 101947.0 - 126254.5 126254.5 - 169964.0 >169964