Professional Documents
Culture Documents
and Computing
Nikos Tzavidis
University of Southampton
Supported by the National Centre for Research Methods (NCRM) & the UK Economic
Part 2
• Introduction to MCMC, STAT-JR & ebooks
• Unit-level SAE models in STAT-JR
• Interoperability with R
• Computer practical
Introduction
Introduction
Na h-Eileanan Siar
Foula
Fair Isle
Moray
Highland
Greater London
Aberdeenshire Enfield
Barnet
Harrow I
Aberdeen City Haringey Redbridge
Havering
Brent Camden G H Barking and
Dagenham
Hillingdon Ealing J Newham
E F
C D
Southwark
Greenwich
w
nslo
Lambet
Hou Bexley
A Wandsw
Angus ort Lewisham
h
h
Merton
Perth and Kinross B
A Richmond upon Thames Bromley F City of London
Sutton Croydon
B Kingston upon Thames G Islington
1
C Hammersmith and Fulham H Hackney
D Kensington and Chelsea I Waltham Forest
E Westminster J Tower Hamlets
Fife
Stirling
1 Dundee City
2
3 West Dunbartonshire
Argyll and Bute
2 Clackmannanshire
Unitary authorities
4 East Dunbartonshire
5 North Lanarkshire
3
4
Falkirk
City of
Local authority districts
East Lothian
6 Renfrewshire Inverclyde
West Edinburgh (including metropolitan districts & London boroughs)
6 8 5 Lothian
7 East Renfrewshire
Midlothian
8 Glasgow City 7
Council areas (Scotland)
North Ayrshire South Local government districts (Northern Ireland)
Lanarkshire 9 Newcastle upon Tyne
Scottish Borders 10 North Tyneside Counties (see note below), unitary authorities,
East 11 Stockton-on-Tees council areas or local government districts (county names
Ayrshire 12 Middlesbrough only shown in bold text)
Boundaries shown are effective as follows: non-metropolitan counties,
South Ayrshire Greater London Authority, unitary authorities, non-metropolitan districts,
Northumberland
London boroughs, metropolitan districts, council areas and local government
Causeway Coast
and Glens Dumfries and Galloway districts at 31 December 2017.
10
Counties shown include metropolitan counties for completeness of
9 Tyne coverage. Metropolitan county councils were abolished in 1986, but the
Derry City Mid and Carlisle South Tyneside
and Strabane East Antrim Gateshead and county areas are still recognised for statistical purposes.
Wear Sunderland
Please visit the ONS Geography web pages for the latest information:
127 https://www.ons.gov.uk/methodology/geography
Mid Ulster Allerdale County Durham
Belfast Hartlepool Please visit the Open Geography portal to browse or download available
Eden
Fermanagh Ards and
Lisburn and
North Down
Cumbria Darlington 11
Redcar and boundaries or other geographical products:
and Omagh Castlereagh 12 Cleveland
http://geoportal.statistics.gov.uk/
Armagh City, Sca
Copeland rbo
Banbridge and Craigavon rou Produced by ONS Geography
n
g
ow
Ha
d
D
an South Lakeland
h
m
e
urn
ble
Newry, Mo
to
13 Preston 24 Liverpool Ryedale
n
14 South Ribble 25 Knowsley Barrow-in North Yorkshire
15 Blackburn with Darwen 26 St. Helens -Furness
Craven
Lancaster Harrogate
127 Antrim and Newtownabbey 16 Hyndburn 27 Halton
17 Rossendale 28 Warrington Ribble York East Riding
18 Burnley 29 Trafford Wyre Valley of Yorkshire
19 Pendle 30 Salford Lancashire 19
Blackpool Bradford Leeds 34 City of Kingston upon Hull
20 Bolton 31 Manchester e 13 Selby
Fyld 18 West Yorkshire 34 35 Lincoln
16
21 Bury 32 Tameside 14 Calderdale
15 17
22 Rochdale 33 Stockport Chorley Wakefield th
21 22 Kirklees Nor hire
23 West Lancashire 23 20 Greater olns North East
Oldham Barnsley Doncaster Linc
Lincolnshire
Wig
Sefton an Manchester South Yorkshire
25 26
30
31 32 Ro
the
ide
24 29 Sh effi rha
Wirral s 28 West
sey
33 High Peak el d m
Isle of Mer 27
Bassetlaw
Lindsey 67 Northampton
Anglesey
37 East Lindsey 68 Wellingborough
Flin
69 East Northamptonshire
st
tsh
Ea
Conwy Dales
ire
39
ire
esh
36 North East Derbyshire 52 Cannock Chase and 70 Cambridge
Staffordshire
Ch Derbyshire 40 Sherwood North Kesteven
hsh
msh
37 Chesterfield 53 Wolverhampton Moorlands 71 Norwich
Amber
ire
Wr
xh 44
38 Bolsover 54 Dudley am 49 Valley 41
ha
48
e
43 Boston
Sou
42
g
39 Mansfield 55 Sandwell
cliffe
in
Gwynedd Derby North Norfolk
th
sh No
tt
40 Ashfield 56 Tamworth
K
Stafford 45 Ru
es
46 en South
41 Broxtowe 57 North Warwickshire
te
Staffordshire Melton Broa
v
Charnwo Holland
42 Erewash 58 Nuneaton and Bedworth 47 od King's Lynn dla Great
estersh nd
Leic
50 52
43 Nottingham 59 Coventry 51 Lichfield ire Rutland
and West Norfolk 71
Yarmouth
Shropshire 62 Peterborough Norfolk
44 Gedling 60 Hinckley and Bosworth 56 60
53 Walsall 61 63 gh
Breckland South
Fenland
45 East Staffordshire 61 Blaby
55 West 57 ou
bor Norfolk
46 South Derbyshire 62 Leicester 54 Midlands 58 Har Corby
69
Birmingham Kettering Waveney
47 North West Leicestershire 63 Oadby and Wigston Camb Ea Forest
Wyre 59 ire st
nsh
Solihull Rugby Huntingdonshire ridge Heath
48 Newcastle-under-Lyme 64 Worcester Forest
pto
Daventry shire
Wo 65
rce
y
Warwick Cambridgeshire
Powys
am 68
r
49 Stoke-on-Trent 65 Bromsgrove Mid Suffolk
bu
st
rth
66
Ceredigion ers Warwickshire
ds
50 Telford and Wrekin 66 Redditch 67 Suffolk
hi
Ma
un
e 70 Suffolk
64 r Bedford
dm
nshir E
Stratford- South Coastal
lver
51 South Staffordshire
St
N
County of
on-Avon pto Cambridgeshire
e
n Hills
Herefordshire
South am
re
Wychavon Milton 86
rth
i
Babergh
sh
No Keynes
dfo
rd
bu ry Be 89 er
tral
Braintree
Pembrokeshire kes Cherwell
Bu Aylesbury Cen e
Uttlesford
Tew hir
t
Carmarthenshire 92
es
c Vale Tendring
ean
77 88 s Essex
rd
lch
West
fo
78 93
fD
Co
Mo
ki
Gloucestershire Oxfordshire
nm res
to
Dacorum 90
ng
Che
rt
95
75 Fo 87 91
t
He
C
s
73 Cotswold
Stroud Fo
am
Neath Port
re
lm
Oxfordshire 94
ou
g
combe Eppin
Maldon
sfor
76
th
Talbot 98
hi
72 96
shi
99
lte
South
s
74 Vale of 97
hire
d
Rochford
rn
Swansea White Horse Oxfordshire 100
re
Wy
Newport 112 101 Southend-on-Sea
79 Swindon Greater London
Bridgend Cardiff 113 Thurrock
Vale of
105 (See Inset) Medway
81 102 10 125 126
Glamorgan North West Berkshire
103 104 10 8
Somerset
6 Thanet
ry
80 109 107 Swale
s
120
rbu
ak
72 Rhondda Cynon Taf Wiltshire Basingstoke 110
Hart
no
Surrey 121 124
e
Dover
111 Maidstone Ca
nt
ve
73 Merthyr Tydfil and Deane Guildford 122
Mole
Se
74 Caerphilly Mendip Kent
o
or
Tunbr
Valley
West Sedgem idge
Test Hampshire East
Waverley We Ashford
75 Blaenau Gwent North Somerset 123 lls
Somerset Valley Hampshire
76 Torfaen Devon Mid
t
Taunton erse Winchester
West Sussex Sussex
Wealden Shepway
77 Cheltenham Deane om
ut hS 114 Chichester Horsham East Sussex Rother 86 Ipswich
78 Gloucester Torridge So North 115 Lewes 87 Oxford
79 South Gloucestershire Mid Devon Dorset East Hastings
Dorset New Forest 116 117 Arun 118119 88 Luton
80 Bath and North East Somerset Devon Dorset
East Devon West Dorset Eastbourne 89 North Hertfordshire
81 City of Bristol
West 83 84 85 Brighton and Hove 90 St Albans
Isle of
Devon Purbeck Wight Portsmouth 91 Welwyn Hatfield
Teignbridge Gosport 92 Stevenage
Exeter 82
93 East Hertfordshire
Torbay 94 Broxbourne
Cornwall South
Hams 120 Epsom and Ewell 95 Harlow
102 Reading 108 Spelthorne 114 Eastleigh 121 Reigate and Banstead 96 Watford
Plymouth 103 Wokingham 109 Surrey Heath 115 Southampton 122 Tandridge 97 Three Rivers
82 Weymouth and Portland 104 Bracknell Forest 110 Woking 116 Fareham 123 Crawley 98 Hertsmere
83 Poole 105 Windsor and Maidenhead 111 Rushmoor 117 Havant 124 Tonbridge and Malling 99 Brentwood
Isles of Scilly 84 Bournemouth 106 Runnymede 112 South Bucks 118 Worthing 125 Dartford 100 Basildon
θ̂iSynthetic = θ̂L
• Auxiliary information
θ̂iSynthetic = x̄ T
i β̂,
ni
XX ni
XX
−1
β̂ = ( xij xT
ij /π ij ) ( xij yij /πij )
i j=1 i j=1
3. Composite estimator
\ = 0.6 · q̂0.5 ,
ARPT
where q̂0.5 is the median.
P
\ )wj
I (yj < ARPT
j
[=
HCR Pn · 100
j=1 wj
For a given sample, let q̂0.2 and q̂0.8 denote the weighted 20% and 80%
quantiles, respectively. Using index sets I≤q̂0.2 and I>q̂0.8 , the quintile share
ratio is estimated by P
wj yj
[ = Pj∈I>q̂0.8
QSR .
j∈I≤q̂ wj yj
0.2
Value by domain:
stratum value
1 Burgenland 5.073746
2 Carinthia 3.590037
3 Lower Austria 3.845026
4 Salzburg 3.829411
5 Styria 3.472333
6 Tyrol 3.628731
7 Upper Austria 3.675467
8 Vienna 4.705347
9 Vorarlberg 4.525096
Reference: Alfons and Templ (2013)
Small Area Estimation 20 / 82 NCRM Training - University of Bristol
Introduction to SAE, Direct & Model-assisted Estimation
Variance estimation
Measures of uncertainty
• Variance,
• Coefficient of Variation
Variance estimation
Resampling methods
• Jackknife
• Bootstrap
Analytic methods
• Taylor linerisation
Value by domain:
stratum value
1 Burgenland 19.53984
2 Carinthia 13.08627
3 Lower Austria 13.84362
...
6 Tyrol 15.30819
7 Upper Austria 10.88977
8 Vienna 17.23468
9 Vorarlberg 16.53731
Reference: Alfons and Templ (2013)
Small Area Estimation 24 / 82 NCRM Training - University of Bristol
Introduction to SAE, Direct & Model-assisted Estimation
10000 0.5
8000 0.4
6000 0.3
0.2
4000 0.1
2000 0.0
Recap
Unit-level models
• Use unit-level data (e.g. from surveys) for model fit
• Area level covariates (predictor variables)
• Sufficient for estimating small area averages/proportions
• Access to unit-level data → possible confidentiality issues
Area-level models
• Use only area-level data for model fit and SAE
• Model specified at the area-level
• Data access possibly less complex than access to unit-level data
In this session we focus on the estimation of small area averages
Key Concept:
Include random area-specific effects to account for between area variation/
unexplained variability between the small areas.
yij = xT
ij β + ui + eij , j = 1, ..., ni , i = 1, ..., m
where
−1 −1
β̂ = (XT V̂ X)−1 XT V̂ y
−1
û = σ̂u2 Z T V̂ (y − Xβ̂)
V̂ = σ̂u2 Z Z T + σ̂e2 I n
An MSE estimator of the small area estimator of the mean under BHF is
(see Prasad & Rao, 1990)
Sampling model
θ̂idirect = θi + ei
Linking model
θ̂idirect = xi β + ui + ei , i = 1, . . . , m,
where ui ∼ N(0, σu2 ) and ei ∼ N(0, σe2i ), with σe2i assumed known
θ̂iFH = x T
i β̂ + ûi
= γi θ̂idirect + (1 − γi )x T
i β̂
An MSE estimator of the small area estimator of the mean under BHF is
(see Prasad & Rao, 1990)
Case study
0.25
1
EBLUP
SE
0.15
0
0.05
−1
−1 0 1 2
Direct F−H
Direct
WI
2
−1
Non-linear indicators
Data Requirements
• Estimation requires access to unit-level population covariates (e.g.
Census microdata)
• Data access is challenging
• Use of area-level models is possible
• Here we focus on unit-level models
Recent methodologies
yij = xT
ij β + ui + eij , j = 1, . . . , ni , i = 1, . . . , D,
σ̂u2
1 Use the sample data to estimate β̂, σ̂u2 , σ̂e2 , ûi and γ̂i = σ̂e2
.
σ̂u2 + ni
2 For l = 1, ..., L
• Compute E (yr |ys ) under the assumption of normal errors
• Generate eij∗ ∼ N(0, σ̂e2 ) and ui∗ ∼ N(0, σ̂u2 · (1 − γ̂i )), simulate a
pseudo-population
yij ∗(l) = xT ∗ ∗
ij β̂ + ûi + ui + eij
yij∗ = xT ∗ ∗
ij β̂ + ui + eij
B
X
[ (θ̂i ) = B −1
MSE (θ̂i∗b − θi∗b )2
b=1
Out-of-sample domains: 0
In-sample domains: 9
Sample sizes:
Units in sample: 503
Units in population: 25000
Min. 1st Qu. Median Mean 3rd Qu. Max.
Sample_domains 16 26 43 55.9 94 101
Pop_domains 799 1671 1889 2778 4071 5857
Explanatory measures:
Marginal_R2 Conditional_R2
0.5198029 0.5198029
Residual diagnostics:
Skewness Kurtosis Shapiro_W Shapiro_p
Error 2.17646 12.5925 0.8551573 4.0933e-21
Random_effect 0.64311 2.6048 0.8870226 1.8589e-01
ICC: 2.610126e-08
• Influence diagnostics
2.5 0.001
0.000
0.0
−0.001
−2.5
−2 0 2 −0.002−0.001 0.000 0.001 0.002
Theoretical quantiles Theoretical quantiles
Model adaptations
• Use an EBP formulation under an alternative distribution (Graf et al.,
2015) - Model under generalised Beta distribution of the second kind
• Use robust methods as an alternative to transformations (Chambers
and Tzavidis, 2006; Ghosh, 2008; Sinha and Rao, 2009; Chambers
et al., 2014; Schmid et al., 2016)
• Use non-parametric models (Opsomer et al., 2008; Ugarte et al.,
2009)
• Elaborate the random effects structure e.g. include spatial structures
(Pratesi and Salvati, 2008; Schmid et al., 2016)
• Use of transformations
• Power transformations
- Box-Cox
- Exponential
- Sign power
- Modulus
- Dual power
- Convex-to-concave
• Multi-parameter transformations
- Johnson
- Sinh-arcsinh
Scaled transformations
- Skewness minimization
- Divergence minimization
- ML/REML
1 For b = 1, ..., B
• Using the already estimated β̂, σ̂u2 , σ̂e2 , λ̂ from the transformed data
T (yij ) = ỹij , simulate a bootstrap superpopulation
ỹij∗ (b) = xT ∗ ∗
ij β̂ + ui + eij
• Transform ỹij∗ (b) to original scale resulting in yij ∗(b)
• For each b population compute the population value θi∗b
• Extract the bootstrap sample in yij ∗(b) and use the EBP method
• Estimate λ with the bootstrap sample
• Obtain θ̂i∗b
PB
[ (θ̂i ) = B −1
MSE ∗b − θi∗b )2
2 b=1 (θ̂i
Using emdi
Currently function ebp() includes a logarithmic or Box-Cox transformation
# EBP estimation function under a Box-Cox
transformation
ebp_au <- ebp(fixed = eqIncome ~ gender + eqsize +
py010n + py050n + py090n +
py100n + py110n + py120n +
py130n + hy040n + hy050n +
hy070n + hy090n + hy145n,
pop_data = eusilcP_HH,
pop_domains = "region",
smp_data = eusilcS_HH,
smp_domains = "region",
pov_line = 0.6*median(eusilcS_HH$eqIncome
),transformation = "box.cox",L=50,
MSE = T,B = 50)
Transformation:
Transformation Method Optimal_lambda Shift_parameter
box.cox reml 0.4317972 0
Explanatory measures:
Marginal_R2 Conditional_R2
0.4543301 0.4543301
Residual diagnostics:
Skewness Kurtosis Shapiro_W Shapiro_p
Error 0.76051 6.3646 0.95643 4.9497e-11
Random_effect 0.58501 2.5533 0.95227 7.1501e-01
Finding λ̂
Graphical representation of the optimal λ̂ is made using the function plot
Box−Cox − REML
−5400
Log−likelihood
−5700
−6000
−6300
0 1 2
λ
• Model-based evaluation
- Uses synthetic data generated under a model
- Sampling is performed repeatedly from the population generated in
each Monte-Carlo round
- Useful for evaluating performance and sensitivity of new methods under
different assumptions
• Design-based evaluation
- Uses frame data (e.g. census data) or synthetic data (not generated
under a model) that preserve the survey characteristics
- Sampling is performed repeatedly by keeping the population fixed
- Useful for comparing competing methods in more realistic settings
Absolute bias:
R
1 X
Biasi = θ̂i,r − θi,r
R r =1
Model-based evaluation
Population data: is generated for m = 50 areas with N = 200 via
yij = 4500 − 400xij + ui + eij
0.50
0.4
0.7
0.25
λ
λ
0.2
0.6
0.00
0.0
0.5 −0.25
0.08
0.04
RMSE
0.06
Bias
0.02
0.04
0.00
0.02
Non Log Log−S Box−Cox Convex−C Non Log Log−S Box−Cox Convex−C
Gini Gini
0.05 0.07
0.04
0.06
0.03
RMSE
0.05
Bias
0.02
0.04
0.01
0.03
0.00
Non Log Log−S Box−Cox Convex−C Non Log Log−S Box−Cox Convex−C
−25000
−23200
−30000
Log−likelihood
−30000
−23300 −40000
−35000
−23400 −50000
Residual diagnostics
Non Log Log−Shift
Quantiles of household−level residuals
30 5 5
20
0 0
10
−5 −5
0
−10 −10
−2 0 2 −2 0 2 −2 0 2
Quantiles of municipal−level residuals
0.25 0.25
200
0.00 0.00
0
−0.25 −0.25
Model diagnostics
0.2
Bias
0.0
−0.2
HCR
0.3
RMSE
0.2
0.1
0.0
Log Log−shift Box−Cox Dual
• Non-parametric models
• Benchmarking methods
References I
Alfons, A., S. Kraft, M. Templ, and P. Filzmoser (2011). Simulation of close-to-reality population data for
household surveys with application to eu-silc. Statistical Methods & Applications 20, 383–407.
Alfons, A. and M. Templ (2013). Estimation of social exclusion indicators from complex surveys: The R package
laeken. Journal of Statistical Software 54, 1–25.
Battese, G. E., R. M. Harter, and W. A. Fuller (1988). An error component model for prediction of county crop
areas using survey and satellite data. Journal of the American Statistical Association 83, 28–36.
Chambers, R., H. Chandra, N. Salvati, and N. Tzavidis (2014). Outlier robust small area estimation. Journal of the
Royal Statistical Society: Series B 76, 47–69.
Chambers, R. and N. Tzavidis (2006). M-quantile models for small area estimation. Biometrika 93, 255–268.
CONEVAL (2010). Methodology for multidimensional poverty measurement in Mexico. Report.
Elbers, C., J. Lanjouw, and P. Lanjouw (2003). Micro-level estimation of poverty and inequality. Econometrica 71,
355–364.
Fay, R. E. and R. A. Herriot (1979). Estimation of income for small places: An application of james-stein procedures
to census data. Journal of the American Statistical Association 74, 269–277.
Ghosh, M. (2008). Robust estimation in finite population sampling. In Beyond Parametrics in Interdisciplinary
Research: Festschrift in Honor of Professor Pranab K. Sen, pp. 116–122.
González-Manteiga, W., M. Lombardía, I. Molina, D. Morales, and L. Santamaría (2008). Bootstrap mean squared
error of a small-area eblup. Journal of Statistical Computation and Simulation 78, 443–462.
Graf, M., J. Marin, and I. Molina (2015). Estimation of poverty indicators in small areas under skewed distributions.
In Proceedings of the 60th World Statistics Congress of the International Statistical Institute, The Hague,
Netherlands.
Kreutzmann, A.-K., S. Pannier, N. Rojas, T. Schmid, N. Tzavidis, and M. Templ (2019). emdi: An r package for
estimating and mapping regional disaggregated indicators. To appear, Journal of Statistical Software.
Molina, I. and Y. Marhuenda (2015). sae: An R package for small area estimation. The R Journal 7, 81–98.
References II
Molina, I. and J. N. K. Rao (2010). Small area estimation of poverty indicators. The Canadian Journal of
Statistics 38, 369–385.
Opsomer, J., G. Claeskens, M. Ranalli, G. Kauermann, and F. Breidt (2008). Nonparametric small area estimation
using penalized spline regression. Journal of the Royal Statistical Society Series B 70, 265–283.
Prasad, N. G. N. and J. N. K. Rao (1990). The estimation of the mean squared error of small area estimators.
Journal of the American Statistical Association 85, 163–171.
Pratesi, M. and N. Salvati (2008). Small area estimation: the eblup estimator based on spatially correlated random
area effects. Statistical Methods & Applications 17, 113–141.
Rojas-Perilla, R., T. Schmid, S. Pannier, and N. Tzavidis (2019). Data-driven transformations in small area
estimation. To appear, Journal of the Royal Statistical Society, Series A.
Schmid, T., N. Tzavidis, R. Münnich, and R. Chambers (2016). Outlier robust small area estimation under spatial
correlation. Scandinavian Journal of Statistics 43, 806–826.
Sinha, S. K. and J. N. K. Rao (2009). Robust small area estimation. The Canadian Journal of Statistics 37,
381–399.
Tzavidis, N., S. Marchetti, and R. Chambers (2010). Robust estimation of small area means and quantiles.
Australian and New Zealand Journal of Statistics 52, 167–186.
Ugarte, M., T. Goicoa, A. Militino, and M. Durban (2009). Spline smoothing in small area trend estimation and
forecasting. Computational Statistics & Data Analysis 53, 3616–3629.