Professional Documents
Culture Documents
Pseudo Eblup 3
Pseudo Eblup 3
SUSANA RUBIN-BLEUER
LEON JANG
SUSANA RUBIN-BLEUER is Adjunct Research Professor at Carleton University and Senior Survey
Methodologist at Statistics Canada. SERGE GODBOUT and LEON JANG are Senior Survey
Methodologists at Statistics Canada.
The authors would like to thank J. N. K. Rao for his advice and for his encouragement and to
Victor Estevao from Statistics Canada for his support in our use of the Statistics Canada Small
Area System.
This work was partially supported by Statistics Canada.
*Address correspondence to Susana Rubin-Bleuer, Statistics Canada, 16 RHC Building, 100
Tunney’s Pasture Driveway, Ottawa, Ontario, Canada K1A 0T6; E-mail: susana.rubin-
bleuer@canada.ca.
doi: 10.1093/jssam/smw013
C The Author 2016. Published by Oxford University Press on behalf of the American Association for Public Opinion Research.
V
All rights reserved. For Permissions, please email: journals.permissions@oup.com
2 Rubin-Bleuer, Jang, and Godbout
X
Ni .X
Ni
Y iC ¼ Cik yik Cik ; Cik > 0: (1.1)
k¼1 k¼1
We identified domains and/or units where the model failed, and investigated
whether we can use model-based MSE estimates for the design-based MSE for
the SEPH data.
In section 2, we describe the Canadian SEPH data, the regression model
used for direct estimation, and the simulation process to create the synthetic
“SEPH” population. In section 3, we define the EBLUP and pseudo-EBLUP
estimators of a weighted mean under a nested error regression model with het-
eroscedastic errors. We give an approximation of the MSE of the pseudo-
equal in both the PD7 and survey files. Two other variables were derived: aver-
age weekly earnings ðAWEik ¼ SWEik =Cik Þ and average monthly earnings
ðAMEik ¼ Pik =Cik Þ.
X
Ni XNi X
ni XNi
Y^
GiC ¼ Cik ^y k = Cik þ ~ ik Cik ðyik ^y ik Þ=
w Cik ;
k¼1 k¼1 k¼1 k¼1
!1
X
m X
ni X
m X
ni
^G ¼
b ~ ik Cik xik x0 ik
w ~ ik Cik xik yik :
w
i¼1 k¼1 i¼1 k¼1
The SEPH population was built as follows. First, the monthly administrative
source was used to list all the establishments in the population with their re-
spective number of employees and average monthly earnings. Table 2.1 reports
population and sample sizes for four key industries or model groups (MGs) in
December.
For respondent sampled units, yik was available and linked to the administra-
tive source. For nonrespondents in the sample, yik was imputed from historical
monthly data because it was assumed stable over short periods of time. For
nonsampled units, yik was imputed by the nearest-neighbor method using xik
and Cik while preserving the cross-sectional correlations between yik and xik
within Industry groups. This produced a census with xik and Cik values ob-
tained from administrative sources and “imputed” yik : Twelve monthly popula-
tions from January to December were created independently of each other.
Note that correlations between yik and xik differ because of the inclusion of bo-
nus and other types of payments in some of the weeks.
To check that the cross-sectional relationships were preserved, we calculated
the unweighted correlations in the original sample and the correlations in the
Pseudo-EBLUP Estimation 5
Table 2.2. Correlation qðyik ; xik Þ in the Sample and in the Nonsampled
Population
1 2 3 4 1 2 3 4
Pni P i
if both the sampling rate ni =Ni and the employment rate ci ¼ k¼1 Cik = Nk¼1 Cik
are “negligible,” and if ni =Ni or ci are not negligible,
E
Y^
iC ¼ ci yiC þ ð1 ci Þfx0 iC b
^E þ ^c E ðyiC x0 iC b
iC
^E Þg; (3.3)
P P
x iC
with ¼ Cik xik = Cik ; si ¼ sample in domain i:
k62si k62si
Remark 3.1.
For this study, we use the well-known method of fitting constants to esti-
mate the variance components. Under model (3.1), r ^ 2 is model-unbiased and
2 2
^ v ¼ maxð~
r rv ; 0Þ is model-consistent (see Rao 2003, pp. 138) for details. Note
that maximum likelihood estimators of the variance components are also
model-consistent under the heteroscedastic model (Jiang and Nguyen, 2012).
Both methods of variance estimation assume that there is no selection bias, and
8 Rubin-Bleuer, Jang, and Godbout
they do not use sampling weights. We will see later that even if this assumption
is not valid, YR, calculated with unweighted variance estimates, seems to be
robust against stratification effects.
Remark 3.2.
Note that we defined YR as a design-consistent estimator of
Y iC as ni ! 1, and such that xiCw and yiCw are design-consistent estimators
of X iC and Y iC , respectively. In addition, YR coincides with the EBLUP esti-
if sampling rates are negligible, and if sampling rates are not negligible, is
given by:
PR
Y^
iC ¼ ci yiC þ ð1 ci Þfx0 iC b
^PR þ ^c iCw ðyiCw x0 iCw b^PR Þg; (3.7)
P m 1 Pm
with ^c iCw as in (3.4) and b^PR ¼ c iCw xiCw x0 iCw
i¼1 ^ i¼1 ^
c iCw xiCwyiCw (see
also Rubin-Bleuer et al. 2007a).
d4
where g1iCw ¼ ð1 ciCw Þr2v ; g3iCw ¼ ðr2 þriCw
2 d2 3 r 2v Þ 2r2v r2 cov
ðr4 varð^
v iCw Þ
r 2v ; r
ð^ ^ 2 Þ þ r4v varð^
r 2 ÞÞ (see supplementary materials online) and g2iCw ¼
iC ciCw
ðX xiCw Þ0 Varfb^YR gðX ^YR g ¼ ðPm Pni x ik z0 Þ1
iC ciCw xiCw Þ; Varfb
i¼1 k¼1 ik
Pseudo-EBLUP Estimation 9
P Pni P Pni
Varð m zik yik Þð m zik x0 ik Þ1 with zik ¼ w
~ ik Cik ðxik ciCw xiCw Þ
Pm Pni
i¼1 k¼1 i¼1 P
k¼1
m Pni Pni P Pi
and Varð i¼1 k¼1 zik yik Þ ¼ rv i¼1 ð k¼1 zik Þ ð k¼1 zik Þ0 þ r2 mi¼1 nk¼1
2
zik z0 ik =Cik : YR
A second-order unbiased estimator of MSEðY^ iC Þ is given by:
YR
mseðY^
iC Þ ¼ g1iCw ð^
r 2v ; r
^ 2 Þ þ g2iCw ð^
r 2v ; r
^ 2 Þ þ 2g3iCw ð^
r 2v ; r
^ 2 Þ (3.9)
in the case of sampling rates ni =Ni and ci are “negligible,” and otherwise by
The term g2iCw is obtained from g2iCw by changing X iC to x0 iC . See the
supplementaryYR
materials online for the theoretical proof and a simulation that
shows mseðY^
iC Þ is second-order unbiased.
ARB Y^
iC ¼ s¼1
and RRMSE Y^ iC ¼ s¼1
:
Y iC Y iC
(4.1)
0.2 0.2
0 0
SYN GREG PR YR EBLUP SYN GREG PR YR EBLUP
0.1 0.2
0 0
SYN GREG PR YR EBLUP SYN GREG PR YR EBLUP
Figure 5.1. Absolute Relative Bias and Relative Root MSE, December
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
YRðsÞ
mse Y^ iC
1 X 1000
Model RRMSE ¼ YRðsÞ
;i ¼ 1; . . . ; m: (4.2)
1000 s¼1
Y^ iC
relative bias and design root relative mean squared error in December, for all
industries and estimators. SYN is the most biased, whereas GREG is the least
^ 2v =^
biased across industries. The estimated signal to noise ratios r r 2 were 1.4%
(Industry 1), 1% (Industry 2), 0.1% (Industry 3), and 0.4% (Industry 4). There
seemed to be a weak association between the bias of the synthetic estimator
and signal-to-noise ratio.
The sampling bias of PR and YR and EBLUP estimators was relatively low.
Figure 5.2 highlights their differences. Some domains are entirely contained
within strata and hence self-weighted, but other domains cut across strata and
are affected by selection bias. EBLUP displays less bias in the former domains,
while PR and YR exhibit less bias in the latter domains: they appear to be ro-
bust against the effects of size stratification.
In terms of RRMSE, Figure 5.1 shows that the direct survey estimator
GREG performs the worst. This was expected as the sample sizes were very
small, generating large sampling errors in GREG. The reduction in RRMSE of
the small area estimators is considerable even when the y x correlation is
low and the area effects are weak. Differences were dramatic when we com-
pared the RRMSE of PR versus the CV of GREG in Figure 5.3. The RRMSE
of PR (dotted line) stayed well below 10%, while the CV of GREG (circled
line) varied between 10% and 90%. The CV of GREG was below 10% on do-
mains with large samples.
Figure 5.4 highlights the differences in RRMSE among PR, YR, and
EBLUP. In domains that did not cut across the size strata, EBLUP performed
better than YR, but PR performed better than YR and EBLUP in most
12 Rubin-Bleuer, Jang, and Godbout
100% (319) 98.1% (344) 97.9% (255) 90.8% (277) 23.9% (240)
based MSEs were calculated assuming negligible sampling rates for rates un-
6. Summary
We studied the performance of several small area estimators in terms of design
bias and MSE. The standard pseudo-EBLUP estimators were adapted for this
Pseudo-EBLUP Estimation 15
study to the estimation of a weighted mean under a unit level linear mixed
model with heteroscedastic error variances. Rubin-Bleuer et al. (2007) devel-
oped these extensions. In this paper, we developed the approximation to their
model-based MSE and MSE estimators for negligible and non-negligible sam-
pling rates.
Our recommendation for Survey of Employment, Payrolls and Hours is
based on the premise that the synthetic population represents reasonably well
the SEPH population with respect to the models used.
Appendix
We examined visually diagnostic plots of the standardized transformed resid-
uals (STDR) (Estevao et al. 2015) in order to back up the heteroscedastic
16 Rubin-Bleuer, Jang, and Godbout
model variance structure. We fit the homoscedastic model with Varðyik Þar2 ,
and the heteroscedastic model with Varðyik Þar2 =Cik , to each of the finite pop-
ulations. The STDR residuals decrease in variability as Cik increases under
constant error variance. See, for example, Figures A.1 and A.2, respectively.
Under the heteroscedastic model, the respective plots of the PR, YR, and
EBLUP residuals show that the linear model is a reasonable fit: Figure A.3
3500
2963
3000
2500
2000
1500
1000
709
500
0
0 2000 4000 6000 8000 10000 12000 14000
3500
2963
3000
2500
2000
1500
1000
709
500
0
0 1000 2000 3000 4000 5000 6000 7000
shows the standardized transformed residual from fitting the December popu-
lation of Industry 4 under the heteroscedastic model. Figure A.4 shows in the
corresponding QQ plot of the STDRs that the model errors deviate from nor-
mality. Similar results were obtained when we fit the model to some of the in-
dividual samples from each of the four populations.
References
Battese, G. E., R. M. Harter, and W. A. Fuller (1988), “An error-components model for prediction
of county crop areas using survey and satellite data,” Journal of the American Statistical
Association, 83, 28–36.
Beaucage, Y., S. Godbout, and Y. Morin (2005), “Survey of Employment, Payrolls and Hours:
New Modelling Perspectives,” Internal document, Statistics Canada.
Estevao, V., M. A. Hidiroglou, Y. You, and S. Rubin-Bleuer (2015), “Methodology Software
Library Small–Area Estimation Unit Level Model with EBLUP and Pseudo EBLUP Estimation
Methodology Specifications,” International Cooperation and Corporate Statistical Methods
Division, Internal document, Statistics Canada.
Fabrizi, E., N. Salvati, M. Pratesi, and N. Tzavidis (2014), “Outlier robust model-assisted small
area estimation,” Biometrical Journal, 56, 157–175.
Jiang, J., and T. Nguyen (2012). “Small area estimation via heteroscedastic nested-error regres-
sion,” Canadian Journal of Statistics, 40, 588–603.
Prasad, N. G. N., and J. N. K. Rao (1999), “On robust small area estimation using a simple random
effects model,” Survey Methodology, 25, 67–72.
Pseudo-EBLUP Estimation 19
Rao, J. N. K. (2003), Small Area Estimation, New Jersey: John Wiley and Sons, Inc.
Rubin-Bleuer, S., S. Godbout, and Y. Morin (2007a), “Evaluation of small domain estimators for
the Survey of Employment, Payroll and Hours,” Proceedings od the Survey Methods Section of
the Statistical Society of Canada, June 10 - 14, 2007. Available at http://www.ssc.ca/sites/ssc/
files/survey/documents/SSC2007_S_RubinBleuer.pdf.
———— (2007b), “Evaluation of small domain estimators for the Survey of Employment, Payroll
and Hours,” Long abstract and presentation, Small Area Estimation, (SAE 2007) in Pisa, Italy,
International Statistical Institute Satellite Conference. Available at http://citeseerx.ist.psu.edu/
viewdoc/download?doi¼10.1.1.503.2985&rep=rep1&type¼pdf.
Stukel, D., and J. N. K. Rao (1997), “Estimation of Regression Models with Nested error Structure