You are on page 1of 22

Spatial smoothing and

statistical disclosure control


Edwin de Jonge and Peter-Paul de Wolf
Contents

Introduc on

Smoothing methods

Examples

Discussion

2
Introduction

Geographical output of NSIs:


— Tradi onally using tables (counts, magnitudes, …)
— Using administra ve regions, defined by e.g.
— property (fire brigade area)
— size of ground area (grid)
— number of units (NUTS)
— Aggregated data plo ed on map

3
Introduction

Geographical output of NSIs:


— Tradi onally using tables (counts, magnitudes, …)
— Using administra ve regions, defined by e.g.
— property (fire brigade area)
— size of ground area (grid)
— number of units (NUTS)
— Aggregated data plo ed on map

3
Introduction

Aggregated data at level of administra ve regions may lead to


— Differencing problems
— Difficul es finding spa al pa erns
— ‘Pixelized’ maps

4
Introduction

Alterna ve: publish microdata with geo coordinates?

How detailed should geo variables in microdata be?

— Detailed enough to reveal spa al pa erns?


— Course enough to ensure privacy?

Masking geo coordinates o en used in health data

5
Introduction

Policy makers o en want distribu on of a property over a region


— How are social benefits distributed over the neighborhoods?
— Which areas are at risk for social problems?
— Which neighborhoods need more a en on?

Tradi onally
— Produce safe tables and plot on a map
— Perturb microdata and plot on a map

6
Introduction

Policy makers o en want distribu on of a property over a region


— How are social benefits distributed over the neighborhoods?
— Which areas are at risk for social problems?
— Which neighborhoods need more a en on?

Tradi onally
— Produce safe tables and plot on a map
— Perturb microdata and plot on a map

Es mate spa al distribu on directly

Plot distribu on on a map

6
Some de initions

A acker scenario we consider:


A acker knows which units are linked to a par cular loca on and
can iden fy them

Type of variable we consider:


Variable with at least one sensi ve category

What we es mate:
Spa al distribu on of rela ve frequencies

For ease of presenta on:


Dichotomous variable with one sensi ve category

7
Some de initions

Sensi ve loca on: Loca on where sensi ve informa on can be


derived about individual units linked to that loca on
‘k-anonymity’: at least 𝑘 observa ons related to iden fiable
group
Rare sensi ve loca on
Loca on with less then 𝑘 nearby neighbors

Group disclosure: at most 𝑓 of an iden fiable group


associated with sensi ve category
Group sensi ve loca on
Loca on with rela ve frequency larger then 𝑓

8
Approach

1. Specify 𝑘 that defines rare sensi ve loca ons


2. Specify 𝑓 that defines group sensi ve loca ons
̂ 𝑦) of rela ve frequency
3. Es mate spa al density 𝑓(𝑥,
4. Par on [0, 𝑓 ] in at most 5 levels
— Assign group sensi ve loca ons to top level
— Assign rare sensi ve loca ons to bo om level
5. Repeat 3 and 4 with different parameters un l ‘sa sfied’

9
Smoothing

Smoothing inherently protects units by distribu ng mass over a


larger region

We will discuss two well known smoothing es mators:


— Kernel density es ma on (KDE)
— 𝑘 nearest neighbor es ma on (kNN)

10
Kernel Density Estimator (KDE)

First a empt: fixed bandwidth


𝑣̂ (𝑥, 𝑦)
𝑓̂ (𝑥, 𝑦) =
𝑑̂ (𝑥, 𝑦)
with
1 𝑥−𝑥 𝑦−𝑦
𝑣̂ (𝑥, 𝑦) = 𝐾 ,
𝑛ℎ ℎ ℎ
{ ∈𝒫| }

the spa al KDE of sensi ve category 𝑐, where 𝑛 = |{𝒫|𝑣 = 𝑐}|


and
1 𝑥−𝑥 𝑦−𝑦
𝑑̂ (𝑥, 𝑦) = 𝐾 ,
𝑁ℎ ℎ ℎ
∈𝒫
the spa al KDE of the (target) popula on where 𝑁 = |𝒫|
11 Whenever ̂ ( , ) , ̂ ( , ) will be undefined and displayed ‘transparently’
k nearest neighbor estimator (kNN)

Frac on of 𝑘 nearest neighbors belonging to the sensi ve


category:
1
𝑓̂ (𝑥, 𝑦) = 𝟙 (𝑣 )
𝑘
{ ∈𝒫| ∈ ( , )}

where
1 𝑣 scores on 𝑐
𝟙 (𝑣 ) =
0 otherwise
and
𝑘𝑛𝑛(𝑥, 𝑦) the set of 𝑘 observa ons nearest to (𝑥, 𝑦)

12
Additional settings

kNN related
— Restrict the search area for 𝑘 nearest neighbors to
observa ons with distance no more than 𝑀 meters
— Otherwise areas with no popula on could s ll show posi ve
rela ve frequencies
— Loca ons with less than 𝑘 nearest neighbors within 𝑀
meters, get an undefined frac on assigned

General
— Par on [0, 𝑓 ] into at most 5 levels
— Generally accepted conven on in data-visualiza on
— Easier to ‘top-code’

13
Examples

— Dataset on Youth Care (randomized loca ons slightly)


— Rela ve to total youth popula on
— Five levels
— KDE with fixed bandwidth
— kNN with restricted maximum distance
— Plots produced with our own script

14
KDE (h = 50m)

15
KDE (h = 100m)

16
kNN (k = 20, distance ≤ 250m)

17
Discussion

U lity related
— ‘Oversmoothing’ may hide or remove spa al pa erns
— How to objec vely measure u lity of such maps?

Disclosure risk related


— How to measure disclosure risk of distribu on of rela ve
frequencies?
— Dichotomous variables with categories 𝑐 and 𝑐
— Dissemina ng both maps
— Large ̂ ( , ) (e.g. ) connected with small ̂ ( , )
— Protect ‘simultaneously’?

18
Discussion

Anomalies
— Probability mass may leak into ‘empty’ areas (rivers, lakes,
woods, …)
— False sensi ve loca ons: leaking addi onal mass to
non-sensi ve loca ons
— Dislocated modes: two adjacent modes may be blend into
single mode in between
— kNN shows ar ficial boundaries

19
Discussion

Future work?
— Variable bandwidth
— Automa c bandwidth selec on
— Boundary kernels
— Disclosure risk measures for spa al distribu on plots
— U lity measures for spa al distribu on plots

Sugges ons are welcome!

Thank you

20

You might also like