Professional Documents
Culture Documents
Andersson, Claes Statistics Sweden S-701 89 rebro, Sweden E-mail: claes.andersson@scb.se Andersson, Karin Statistics Sweden P.O. 24 300 S-104 51 Stockholm, Sweden E-mail: karin.andersson@scb.se Lundquist, Peter Statistics Sweden P.O. 24 300 S-104 51 Stockholm, Sweden E-mail: peter.lundquist@scb.se
Introduction
Repeated sample surveys are a rule rather than an exception within National Statistical Institutes (NSIs). It is of importance to estimate the value of different parameters, e.g. the number of unemployed, at the reference time point but also to get reliable estimates of change in the parameters between the reference times. A common design for this is to use rotating panel designs meaning that a part of the sample at time point 0 is retained at time point 1 and a complementary sample is taken at time point 1. The variance of an estimated difference between parameters at time 0 and 1 is a function of the variances at each time point and the covariance between the estimated parameters. The estimation of the variance of a difference between two totals based on partially overlapping samples is described in several papers (Qualit & Till 2008; Wood 2008; Berger 2004; Tam 1984). The case when the sampling rates are low and simple random sampling is used was treated by Kish (1965). In this paper we will treat the situation with a fix population and a rotating sample design that means fix sample sizes at both occasions and a fix overlapping rate. The assumption of a fix population is an approximation that is valid in many situations where individuals are sampled and the time lag is short between the reference times of the measurements, for example in a monthly or quarterly LFS. The focus will be on methods for variance estimation that can be used in practice in an NSI when auxiliary information is used in the estimation process and where the number of parameters is large and typically for domains that as a rule do not coincide with the stratification. Different methods of doing the variance estimation of the estimated changes are investigated; the relatively simple method used for LFS within Statistics Sweden is compared with two other methods.
tification is the same at time point 0 and 1, the remaining (overlapping) part s01 and the complement s1|0 are taken as srswor within strata. This procedure is a special case of the situation described in Berger (2004) and it is in accordance with the design for the Swedish LFS with overlapping fraction g=7/8 between two adjacent quarters. The first order inclusion probabilities in case described in this paper is 0k=nh/Nh, k h (stratum h), =gh 0k+qh(1- 0k), k h, where gh=n01h/n0h, qh=n1|0h/(Nh-n0h). The reason why we use stratum indices for g 1k and q is the fact that in practice we will get nonresponse and will condition on the number of respondents within each stratum.
n1h 1 N h ) s y 01h ,
where sy01h is the simple covariance between y0 and y1 calculated from the sample s01h. It was proposed by Kish (1965) to estimate the covariance from a combination of the correlation estimated from s01 and the variances estimated from s1 and s0,
C K (t1 , t 0 ) t:01 V1 (t1 )V0 (t 0 ) ; t:01 C01 (t1 , t 0 ) V01 (t1 )V01 (t 0 ) ,
where Vi ( ) is calculated from the sample si, i=0, 1 and V01 ( ) is calculated from the common part s01. Although Kish used the standard with replacement estimator, the same principle can be used for other types of estimators. If V (t0 ) V (t1 ) then the variance of may be approximated by AV ( ) 2V (t1 )(1 t:01 ) with an obvious simple estimator 2V (t1 )(1 t:01 ) . In most, if not all, surveys we will get nonresponse, it was noted by Qualit & Till (2008) that by conditioning on the respondents we can use the response sets instead of the sample sets in the formulas. However, the risk for nonrespons bias is always there and a positive bias in 01 based on the response set r01 is likely in practice as the correlation tends to be stronger among units responding at both occasions. It was shown by Berger (2004) that even a small positive bias in 01 implies a large negative bias in V ( ) when the correlation is large. It should be noted here that even if other parameters for measuring change than the difference between totals may be of interest, for example the ratio, the problem in the estimation of the covariance in a rotating design will be the same. Berger (2004) suggested an estimator of the covariance by assuming that the vector , n ) (t , t , n , n , n ) has a multivariate normal distribution under the Poisson sampling scheme with (t 1 0 1 0 01 covariance matrix,
tt nt
nt nn
where n
(n1 , n 0 , n 01 ) is the vector of sample sizes within each stratum for the samples s0, s1 and s01.
The covariance matrix of t is obtained by conditioning on the vector n using standard theory, 1 1 t n tt nt nn nt , which is estimated by, t n tt nt nn nt , see Berger (2004). The method is quite general and can handle situations where s0 does not involve srs and where s1 need not be taken within each stratum but can be taken independent of the stratification at time point 0. It also uses all information in the two samples rather than just the common part of the two samples. Berger (2004) suggests that the covariance is calculated by the same principle as C above, i.e.
K
01
is calculated from t n as defined above while the variances are estimated at time points 0 and 1 by condi-
tioning on n0 and n1 respectively, leading to the Hajek (1964) estimator of the variance. He also suggests a simple way to do the estimation of the covariance by using standard multiple regression software. Normally, a large number of differences are of interest for an NSI, each regarding a domain of study that not necessarily coincides with a stratum in the design. This means that the vector t normally is of length >2. Further, when auxiliary information is used in the estimation of t by regression estimators or calibration against know totals, Qualit & Till (2008) pointed out that the variables of interest in calculating the covariance are not y0 and y1 but the residuals e0 and e1. The residuals used in domain d may be defined as eidk yidk x ik id , i=0,1 where yidk=yik when unit k belongs to domain d, and yidk=0 otherwise. The auxiliary vector x does not have to be the same or to have the same values for unit k at both occasions.
The variance estimator of the estimated change in the study is V ( ) V (t1 ) V (t 0 ) 2C (t1 , t 0 ) , where the covariance is estimated by t:01 V1 (t1 )V0 (t 0 ) . The correlation t:01 between the estimated totals is
calculated in three different ways. In the old days when the regression estimator was not used and the calculation of standard errors was cumbersome the estimation of change was done in a rather naive way. The correlation y;01 between the yvalues at time points 0 and 1 was calculated, based on the common part s01, and not using the design, i.e. by treating s01 as generated by srs. This made sense to some extent since the allocation of the sample was proportional to stratum sizes the nonresponse rates were low and the HT-estimator was used. It was also observed that the changes in the calculated correlations were small over time meaning that a set of correlations or constants could be used over a longer period, typically several years. Our first estimator uses the correlation t;01 g y;01 .
The second estimator in our study uses the residuals that come from the GREG estimator at the two time points. The correlation is calculated using the residuals eik yik x ik i and using the design weights in the calculations,
e:01
with se:01
se:01
(
2 2 se:0 se:1 ,
s01 wk e0 k s01 wk e1k s01 wk )
s01 wk e0 k e1k
((
s01 wk )
1) ; wk
N h mh , k
h, Nh is the
stratum size and mh is the number of units in s01h. The residuals are calculated from s0 and s1, i.e. the whole samples are used at time 0 and 1. The second estimator uses t:01
g e:01 .
The third estimator is the one suggested by Berger (2004). In the calculation of t:01 the inversion of the matrix nn may be cumbersome when the number of strata is large, it is a 3H 3H symmetric matrix, where H is the number of strata. However, it turns out that the matrix contains a number of diagonal sub matrices which simplifies the storage and inversion by using the theory for partitioned matrices. The idea behind the second and third estimator is to find a better way to calculate the constants (correlations) to be used in the present production framework within Statistics Sweden.
Some results
In table 2 the results from the simulation study are shown. In column A is the variance of the differences calculated from the population, these are the theoretical values that the three different estimators are supposed to be estimators of. In columns B, E and H are the relative differences, R (V V 1)100 between the means of the esti mated variances, V for each estimator. It is obvious that the first, nave, estimator underestimate the variance, the bias is larger for g=7/8 than for g=4/8. This is not unexpected since the correlation between the yvalues is larger than between the e-values, while the variances for the totals are estimated by GREG. The differences between the second and third estimators are small in all cases. In columns D, G and J (CI95) are the coverage rates for nominal 95% confidence intervals calculated for each sample. As expected the first estimator gives too short intervals, although the coverage rates are not
that low, considering the underestimate of the variances. For the second and third estimator the coverage rates are close to the nominal value, 95%.
Table 1
Estimator g=7/8 Quarter Employed Not in LF Hours worked g=4/8 Year Employed Not in LF Hours worked 24 214 23 049 6 882 50.5 5 922 49.9 20 611 19 710 0.30 0.14 0.30 0.27 32 890 18 119 17 981 11 012 73.4 4 776 44.5 4 757 44.7 28 209 16 650 15 984 0.27 0.10 0.27 0.25 26 255 8 556 23 839 66.7 -20.2 -22.3 -15.5 -9.1 N=21 671 Unemployed N=39 131 Unemployed HT A
V (t0 )
GREG B
V (t1 )
C
t :01
D
V (t1 t 0 )
E
V (t0 )
F
V (t1 )
G
t :01
V (t1 t 0 ) (H/D-1)%
45 169 64 241 64 065 23 200 17 674 16 823 38 895 56 591 57 347 218.5 176.5 154.7
Table 2
Estimator A g=7/8 Quarter
V (t1 t 0 )
1 B C D E
2 F G H
3 I J
(1) (1) (2) t:01 CI95(2) R(3) t:01 CI95(3) t:01 CI95 R R 40 260 -20.8 0.75 92.0 -2.8 0.69 94.6 -3.1 0.70 94.7
21 375 -14.6 0.47 35 621 -21.0 0.75 212.4 -12.4 0.44 15 969 -23.5 0.66 7 555 -9.9 0.29 14 623 -23.8 0.66 43.7 -16.0 0.59 81 703 -10.7 0.43 27 527 268.2 -8.9 0.27 -7.7 0.25 72 519 -10.8 0.43
93.1 -0.8 0.39 93.1 -2.9 0.70 95.0 -2.6 0.38 91.7 -2.9 0.57 94.4 0.1 0.21 92.3 -2.6 0.56 93.5 -2.8 0.52 94.1 -5.6 0.40 93.1 -2.7 0.22 94.3 -5.7 0.40 93.8 -3.2 0.22 92.9 -7.3 0.33 94.8 -1.7 0.12 93.7 -7.1 0.32 93.7 -6.3 0.30
94.8 -1.6 0.39 95.3 -3.0 0.70 96.1 -1.3 0.37 95.1 -2.9 0.57 95.4 -0.5 0.21 95.8 -2.8 0.56 94.8 -2.4 0.52 94.6 -5.7 0.40 94.1 -3.2 0.23 94.8 -5.7 0.40 94.6 -2.6 0.21 94.0 -7.2 0.33 95.3 -2.0 0.12 94.4 -7.2 0.32 94.2 -6.2 0.30
94.9 95.3 96.2 95.1 95.3 95.8 94.8 94.7 93.9 94.9 94.6 93.9 95.3 94.4 94.3
Year
g=4/8 Quarter Employed Unemployed Not in LF Hours worked Year Employed Unemployed Not in LF Hours worked
26 255 -14.4 0.38 8 556 -6.8 0.16 23 839 -14.6 0.38 66.7 -11.3 0.34
Discussion
The way the first estimator is applied in the simulation study does not fully reflect the way it is used in the Swedish LFS. As mentioned before the correlation between the y-values are calculated for one year and then used for many years thereafter until it is assumed that new values are needed. To mimic that situation a fixed set of correlation should be calculated and then be used for all samples. The second estimator is a bit more advanced than the first in the sense that it takes the design into account as well as the fact that the GREG-estimator is used in the estimation of the totals at each time point. The estimator, e:01 is basically an estimator of the correlation between the residuals at time point 0 and 1 in the population and it is not obvious how it relates to the correlation between the estimated totals under the present design, however, it seems to work, at least in the cases shown here. A design based estimator of the correlation between the estimated totals would have been constructed in a different way. The third estimator takes both the design and the residuals into account and it has a theoretical basis, although some approximations are made. It is also more complicated to understand and maybe not very transparent for non-experts. However, a procedure based on a theory should be preferred to ad-hoc based procedures. The adaption of the second and third estimators in a large scale production system for official statistics is a bit complicated but can be solved. One obvious way is to calculate a set of correlation for one year and use that set for a number of years like it is done today in the Swedish LFS.
REFERENCES
Berger, Y. G. (2004). Variance Estimation of Measures of Change in Probability Sampling. Canadian Journal of Statistics, 32, pp. 451-467. Hajek, J. (1964). Asymptotic theory of rejective sampling with varying probabilities from a finite population. The Annals of Mathematical Statistics, 35, pp. 1491-1523. Kish, L. (1965). Survey Sampling: Wiley, New York. Qualit, L. and Till, Y. (2008). Variance Estimation of Changes in Repeated Surveys and its Application to the Swiss Survey of Value Added. Survey Methodology, 34, pp 173-181. Tam, S. M. (1984). On Covariances from overlapping Samples. The American Statistician, 38, pp. 288-289. Wood, J. (2008). On the Covariance Between Related Horvitz-Thompson Estimators. Journal of Official Statistics, 24, pp. 53-78.