Presentation PSD 2016

Spatial smoothing and
statistical disclosure control

Edwin de Jonge and Peter-Paul de Wolf
Contents
Introduc on
Smoothing methods
Examples
Discussion
2
Introduction
Geographical output of NSIs:

— Tradi onally using tables (counts, magnitudes, …)
— Using administra ve regions, defined by e.g.
— property (fire brigade area)
— size of ground area (grid)
— number of units (NUTS)
— Aggregated data plo ed on map
3
Introduction
Geographical output of NSIs:

— Tradi onally using tables (counts, magnitudes, …)
— Using administra ve regions, defined by e.g.
— property (fire brigade area)
— size of ground area (grid)
— number of units (NUTS)
— Aggregated data plo ed on map
3
Introduction
Aggregated data at level of administra ve regions may lead to

— Differencing problems
— Difficul es finding spa al pa erns
— ‘Pixelized’ maps
4
Introduction
Alterna ve: publish microdata with geo coordinates?
How detailed should geo variables in microdata be?
— Detailed enough to reveal spa al pa erns?

— Course enough to ensure privacy?
Masking geo coordinates o en used in health data
5
Introduction
Policy makers o en want distribu on of a property over a region

— How are social benefits distributed over the neighborhoods?
— Which areas are at risk for social problems?
— Which neighborhoods need more a en on?
Tradi onally
— Produce safe tables and plot on a map
— Perturb microdata and plot on a map
6
Introduction
Policy makers o en want distribu on of a property over a region

— How are social benefits distributed over the neighborhoods?
— Which areas are at risk for social problems?
— Which neighborhoods need more a en on?
Tradi onally
— Produce safe tables and plot on a map
— Perturb microdata and plot on a map
Es mate spa al distribu on directly
Plot distribu on on a map
6
Some de initions
A acker scenario we consider:

A acker knows which units are linked to a par cular loca on and
can iden fy them
Type of variable we consider:

Variable with at least one sensi ve category
What we es mate:
Spa al distribu on of rela ve frequencies
For ease of presenta on:

Dichotomous variable with one sensi ve category
7
Some de initions
Sensi ve loca on: Loca on where sensi ve informa on can be

derived about individual units linked to that loca on
‘k-anonymity’: at least 𝑘 observa ons related to iden fiable
group
Rare sensi ve loca on
Loca on with less then 𝑘 nearby neighbors
Group disclosure: at most 𝑓 of an iden fiable group

associated with sensi ve category
Group sensi ve loca on
Loca on with rela ve frequency larger then 𝑓
8
Approach
1. Specify 𝑘 that defines rare sensi ve loca ons

2. Specify 𝑓 that defines group sensi ve loca ons
̂ 𝑦) of rela ve frequency
3. Es mate spa al density 𝑓(𝑥,
4. Par on [0, 𝑓 ] in at most 5 levels
— Assign group sensi ve loca ons to top level
— Assign rare sensi ve loca ons to bo om level
5. Repeat 3 and 4 with different parameters un l ‘sa sfied’
9
Smoothing
Smoothing inherently protects units by distribu ng mass over a

larger region
We will discuss two well known smoothing es mators:

— Kernel density es ma on (KDE)
— 𝑘 nearest neighbor es ma on (kNN)
10
Kernel Density Estimator (KDE)
First a empt: fixed bandwidth

𝑣̂ (𝑥, 𝑦)
𝑓̂ (𝑥, 𝑦) =
𝑑̂ (𝑥, 𝑦)
with
1 𝑥−𝑥 𝑦−𝑦
𝑣̂ (𝑥, 𝑦) = 𝐾 ,
𝑛ℎ ℎ ℎ
{ ∈𝒫| }
the spa al KDE of sensi ve category 𝑐, where 𝑛 = |{𝒫|𝑣 = 𝑐}|

and
1 𝑥−𝑥 𝑦−𝑦
𝑑̂ (𝑥, 𝑦) = 𝐾 ,
𝑁ℎ ℎ ℎ
∈𝒫
the spa al KDE of the (target) popula on where 𝑁 = |𝒫|
11 Whenever ̂ ( , ) , ̂ ( , ) will be undefined and displayed ‘transparently’
k nearest neighbor estimator (kNN)
Frac on of 𝑘 nearest neighbors belonging to the sensi ve

category:
1
𝑓̂ (𝑥, 𝑦) = 𝟙 (𝑣 )
𝑘
{ ∈𝒫| ∈ ( , )}
where
1 𝑣 scores on 𝑐
𝟙 (𝑣 ) =
0 otherwise
and
𝑘𝑛𝑛(𝑥, 𝑦) the set of 𝑘 observa ons nearest to (𝑥, 𝑦)
12
Additional settings
kNN related
— Restrict the search area for 𝑘 nearest neighbors to
observa ons with distance no more than 𝑀 meters
— Otherwise areas with no popula on could s ll show posi ve
rela ve frequencies
— Loca ons with less than 𝑘 nearest neighbors within 𝑀
meters, get an undefined frac on assigned
General
— Par on [0, 𝑓 ] into at most 5 levels
— Generally accepted conven on in data-visualiza on
— Easier to ‘top-code’
13
Examples
— Dataset on Youth Care (randomized loca ons slightly)

— Rela ve to total youth popula on
— Five levels
— KDE with fixed bandwidth
— kNN with restricted maximum distance
— Plots produced with our own script
14
KDE (h = 50m)
15
KDE (h = 100m)
16
kNN (k = 20, distance ≤ 250m)
17
Discussion
U lity related
— ‘Oversmoothing’ may hide or remove spa al pa erns
— How to objec vely measure u lity of such maps?
Disclosure risk related

— How to measure disclosure risk of distribu on of rela ve
frequencies?
— Dichotomous variables with categories 𝑐 and 𝑐
— Dissemina ng both maps
— Large ̂ ( , ) (e.g. ) connected with small ̂ ( , )
— Protect ‘simultaneously’?
18
Discussion
Anomalies
— Probability mass may leak into ‘empty’ areas (rivers, lakes,
woods, …)
— False sensi ve loca ons: leaking addi onal mass to
non-sensi ve loca ons
— Dislocated modes: two adjacent modes may be blend into
single mode in between
— kNN shows ar ficial boundaries
19
Discussion
Future work?
— Variable bandwidth
— Automa c bandwidth selec on
— Boundary kernels
— Disclosure risk measures for spa al distribu on plots
— U lity measures for spa al distribu on plots
Sugges ons are welcome!
Thank you
20

Presentation PSD 2016

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Presentation PSD 2016

Uploaded by

Copyright:

Available Formats

Spatial smoothing and

statistical disclosure control

Geographical output of NSIs:

Geographical output of NSIs:

Aggregated data at level of administra ve regions may lead to

Alterna ve: publish microdata with geo coordinates?

How detailed should geo variables in microdata be?

— Detailed enough to reveal spa al pa erns?

Masking geo coordinates o en used in health data

Policy makers o en want distribu on of a property over a region

Policy makers o en want distribu on of a property over a region

Es mate spa al distribu on directly

Plot distribu on on a map

A acker scenario we consider:

Type of variable we consider:

For ease of presenta on:

Sensi ve loca on: Loca on where sensi ve informa on can be

Group disclosure: at most 𝑓 of an iden ﬁable group

1. Specify 𝑘 that deﬁnes rare sensi ve loca ons

Smoothing inherently protects units by distribu ng mass over a

We will discuss two well known smoothing es mators:

First a empt: ﬁxed bandwidth

the spa al KDE of sensi ve category 𝑐, where 𝑛 = |{𝒫|𝑣 = 𝑐}|

Frac on of 𝑘 nearest neighbors belonging to the sensi ve

— Dataset on Youth Care (randomized loca ons slightly)

Disclosure risk related

Sugges ons are welcome!

You might also like