You are on page 1of 18

# Spatial Data Mining:

## Three Case Studies

www.cs.umn.edu/~shekhar/problems.html

Shashi Shekhar, University of Minnesota
Presented to UCGIS Summer Assembly 2001

Background
NSF workshop on GIS and DM (3/99)
Spatial data
[1, 8]
- traffic, bird habitats,
global climate, logistics, ...
For spatial patterns - outliers, location
prediction, associations, sequential
associations, trends,

Framework
Problem statement: capture special needs
Data exploration: maps, new methods
Try reusing classical methods
from data mining, spatial statistics
If reuse is not possible, invent new methods
Validation, Performance tuning
Case 1: Spatial Outliers
Problem: stations different from neighbors [SIGKDD 2001]
Data - space-time plot, distr. Of f(x), S(x)
Distribution of base attribute:
spatially smooth
frequency distribution over value domain: normal
Classical test - Pr.[item in population] is low
Q? distribution of diff.[f(x), neighborhood agg{f(x)}]
Insight: this statistic is distributed normally!
Test: (z-score on the statistics) > 2
Performance - spatial join, clustering methods
Spatial outlier detection
[4]

Spatial outlier
A data point that is extreme relative to
it neighbors
Given
A spatial graph G={V,E}
A neighbor relationship (K neighbors)
An attribute function f: V -> R
An aggregation function f
aggr
: R
k
-> R
Confidence level threshold u
Find
O = {v
i
| v
i
eV, v
i
is a spatial outlier}

Objective
Correctness: The attribute values of v
i

is extreme, compared with its neighbors
Computational efficiency
Constraints
Attribute value is normally distributed
Computation cost dominated by I/O op.
Spatial outlier detection
Spatial Outlier Detection Test
1. Choice of Spatial Statistic
S(x) = [f(x)E
ye N(x)
(f(y))]
Theorem: S(x) is normally distributed
if f(x) is normally distributed
2. Test for Outlier Detection
| (S(x) -
s
) / o
s
| > u
Hypothesis
I/O cost determined by clustering efficiency

f(x) S(x)
Spatial outlier and its neighbors
Spatial outlier detection
Results
1. CCAM achieves higher
clustering efficiency (CE)
2. CCAM has lower I/O cost
3. Higher CE leads to lower
I/O cost
4. Page size improves CE for
all methods

Z-order
CCAM
I/O cost CE value
Cell-Tree
Case 2: Location Prediction
Citations: SIAM DM Conf. 2001, SIGKDD DMKD 2000
Problem: predict nesting site in marshes
given vegetation, water depth, distance to edge, etc.
Data - maps of nests and attributes
spatially clustered nests, spatially smooth attributes
Classical method: logistic regression, decision trees, bayesian classifier
but, independence assumption is violated ! Misses auto-correlation !
Spatial auto-regression (SAR), Markov random field bayesian
classifier
Open issues: spatial accuracy vs. classification accurary
Open issue: performance - SAR learning is slow!
Location Prediction
[6, 7, 8]

Given:
1. Spatial Framework
2. Explanatory functions:
3. A dependent function
4. A family of function mappings:

Find: A function

Objective:maximize
classification_accuracy

Constraints:
Spatial Autocorrelation exists

} ,... {
1 n
s s S =
R S f
k
X
:
} 1 , 0 { : S f
Y
} 1 , 0 { ... R R
e
y
f

) ,

(
y y
f f

Nest locations
Distance to open water
Vegetation durability
Water depth
Evaluation: Changing Model
Linear Regression
Spatial Regression
Spatial model is better

c | + = X y
c | + + = X Wy y
Evaluation: Changing measure
)) ( . , ( ) , ( P nearest A A dist P A ADNP
k
k
k
=
New measure:
Case 3: Spatial Association Rules
Citation: Symp. On Spatial Databases 2001
Problem: Given a set of boolean spatial features
find subsets of co-located features, e.g. (fire, drought, vegetation)
Data - continuous space, partition not natural, no reference feature
Classical data mining approach: association rules
But, Look Ma! No Transactions!!! No support measure!
Approach: Work with continuous data without transactionizing it!
confidence = Pr.[fire at s | drought in N(s) and vegetation in N(s)]
support: cardinality of spatial join of instances of fire, drought, dry veg.
participation: min. fraction of instances of a features in join result
new algorithm using spatial joins and apriori_gen filters
Co-location Patterns
[2, 3]

Can you find co-location patterns from the following sample dataset?
Co-location Patterns
Can you find co-location patterns from the following sample dataset?
Co-location Patterns
Spatial Co-location
A set of features frequently co-located
Given
A set T of K boolean spatial feature types
T={f
1
,f
2
, , f
k
}
A set P of N locations P={p
1
, , p
N
} in a
spatial frame work S, p
i
e P is of some
spatial feature in T
A neighbor relation R over locations in S
Find
T
c
= subsets of T frequently co-located

Objective
Correctness
Completeness
Efficiency
Constraints
R is symmetric and reflexive
Monotonic prevalence measure
Reference Feature Centric
Window Centric Event Centric
Co-location Patterns
Participation index
Participation ratio pr(f
i
, c) of feature f
i
in co-location c = {f
1
, f
2
, , f
k
}: fraction of instances of f
i
with
feature {f
1
, , f
i-1
, f
i+1
, , f
k
} nearby 2.Participation index = min{pr(f
i
, c)}

Algorithm
Hybrid Co-location Miner
Association rules Co-location rules
underlying space discrete sets continuous space
item-types item-types events /Boolean spatial features
collections transactions neighborhoods
prevalence measure support participation index
conditional probability measure Pr.[ A in T | B in T ] Pr.[ A in N(L) | B at L ]
Comparison with association rules
Conclusions & Future Directions
Spatial domains may not satisfy assumptions of classical methods
data: auto-correlation, continuous geographic space
patterns: global vs. local, e.g. spatial outliers vs. outliers
data exploration: maps and albums

Open Issues
patterns: hot-spots, blobology (shape), spatial trends,
metrics: spatial accuracy(predicted locations), spatial contiguity(clusters)
spatio-temporal dataset
scale and resolutions sentivity of patterns
geo-statistical confidence measure for mined patterns
References
1. S. Shekhar, S. Chawla, S. Ravada, A. Fetterer, X. Liu and C.T. Liu, Spatial Databases: Accomplishments and Research Needs,
IEEE Transactions on Knowledge and Data Engineering, Jan.-Feb. 1999.

2. S. Shekhar and Y. Huang, Discovering Spatial Co-location Patterns: a Summary of Results, In Proc. of 7th International
Symposium on Spatial and Temporal Databases (SSTD01), July 2001.

3. S. Shekhar, Y. Huang, and H. Xiong, Performance Evaluation of Co-location Miner, the IEEE International Conference on Data
Mining (ICDM01), Nov. 2001. (submitted)

4. S. Shekhar, C.T. Lu, P. Zhang, "Detecting Graph-based Spatial Outliers: Algorithms and Applications, the Seventh ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, 2001.

5. S. Shekhar, S. Chawla, the book Spatial Database: Concepts, Implementation and Trends. (To be published in 2001)

6. S. Chawla, S. Shekhar, W. Wu and U. Ozesmi, Extending Data Mining for Spatial Applications: A Case Study in Predicting Nest
Locations, Proc. Int. Confi. on 2000 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery
(DMKD 2000), Dallas, TX, May 14, 2000.

7. S. Chawla, S. Shekhar, W. Wu and U. Ozesmi, Modeling Spatial Dependencies for Mining Geospatial Data, First SIAM
International Conference on Data Mining, 2001.

8. S. Shekhar, P.R. Schrater, R. R. Vatsavai, W. Wu, and S. Chawla, Spatial Contextual Classification and Prediction Models for
Mining Geospatial Data, IEEE Transactions on Multimedia, 2001. (Submitted)

Some papers are available on the Web sites: http://www.cs.umn.edu/research/shashi-group/