Professional Documents
Culture Documents
Spatial Datamining
Spatial Autocorrelation
Spatial Computing
Data, Data everywhere yet ....
• Can’t find the data I need
• data is scattered over the network
• many versions, subtle differences
• Can’t get the data I need
• need an expert to get the data
• Can’t understand the data I found
• available data poorly documented
• Can’t use the data I found
• results are unexpected
• data needs to be transformed from one
form to other
What is Data Warehousing?
Data
Data Warehouse?
• Different definitions -
• A decision support database that is maintained separately from the organization’s
operational database
• Support information processing by providing a solid platform of consolidated, historical data
for analysis.
• Data warehousing:
• The process of constructing and using data warehouses
Data Warehouse—Subject-Oriented
• Organized around major subjects.
[For example - customer, product, sales]
• Focusing on the modeling and analysis of data for decision makers, not on daily
operations or transaction processing.
• Provide a simple and concise view around particular subject issues by excluding
data that are not useful in the decision support process.
Data Warehouse—Integrated
• Time horizon for the data warehouse is significantly longer than that of
operational systems.
• Operational database: current value data.
• Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)
time,location,supplier
time,item,location
3-D cuboids
time,item,supplier item,location,supplier
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_state
Measures country
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
all all
Office Day
Month
A Sample Data Cube
Total annual sales
Date of TV in India
1Qtr 2Qtr 3Qtr 4Qtr sum
TV
PC India
Country
VCR
sum
China
Mexico
sum
Cuboids Corresponding to the Cube
Monitor
& OLAP Server
other Metadata
sources Integrator
Analysis
Operational Extract
Serve Query
DBs Transform Data
Load Warehouse Reports
Refresh
Data mining
Data Marts
Distributed Data
Marts
Enterprise Data
Data Mart Data Mart
Warehouse
• Knowledge discovery in databases (KDD) --- more general than data mining
• KDD process consists of six phases
1. Data selection 2. Data cleaning
3. Enrichment 4. Data transformation
5. Data mining 6. Display and reporting
• Example
Consumer goods retailer
• Association rule: whenever a customer buys product X he also buys product Y
• Sequential pattern: whenever a customer buys a camera then within six months he buys
photographic supplies
• Classification trees: credit-card customers, cash customers, etc.
Goals of Data Mining
• Prediction --- data mining can show how certain attributes within the data
will behave in the future
• Identification --- data patterns can be used to identify the existence of an
item, event, or an activity
• Classification --- data mining can partition the data so that different classes
or categories can be identified based on combinations of parameters
• Optimization --- one eventual goal of data mining may be to optimize the use
of limited resources such as time, space, money, or materials
Knowledge Discovery during Data Mining
• Raw data Information knowledge
• Deductive knowledge
• Deduce new information based on applying pre-specified logical rules of deduction on the
given data
• Inductive knowledge
• Discover new rules and patterns from the available data
• Data mining addresses inductive knowledge
• Discovered knowledge can be
• Unstructured like rules or propositional logic
• Structured like decision trees, semantic network, neural networks, etc
Types of Knowledge Discovered
• Examples (Historic)
• 1855 Asiatic Cholera in London : A water pump identified as the source
• Fluoride and healthy gums near Colorado river
• Theory of Gondwanaland - continents fit like pieces of a jigsaw puzlle
• Examples (Recent)
• Cancer clusters to investigate environment health hazards
• Crime hotspots for planning police patrol routes
• Bald eagles nest on tall trees near open water
• Nile virus spreading from north east USA to south and west
• Unusual warming of Pacific ocean (El Nino) affects weather in USA
[Ref: Spatial Databases: A Tour, by Shashi Shekhar, Sanjay Chawla; Internet resources]
Spatial Pattern ?
• What is a Pattern?
• A frequent arrangement, configuration, composition, regularity
• A rule, law, method, design, description
• A major direction, trend, prediction
• A significant surface irregularity or unevenness
[Ref: Spatial Databases: A Tour, by Shashi Shekhar, Sanjay Chawla; Internet resources]
Why Spatial Data Mining? (contd.)
• New understanding of geographic processes for Critical questions
• Ex. How is the health of planet Earth?
• Ex. Characterize effects of human activity on environment and ecology
• Ex. Predict effect of El Nino on weather, and economy
• Traditional approach: manually generate and test hypothesis
• But, spatial data is growing too fast to analyze manually
• Satellite imagery, GPS tracks, sensors on highways, …
• Number of possible geographic hypothesis too large to explore manually
• Large number of geographic features and locations
• Number of interacting subsets of features grow exponentially
• SDM may reduce the set of plausible hypothesis
• Identify hypothesis supported by the data
• For further exploration using traditional statistical methods
Spatial Data Mining: Actors
• Domain Expert -
• Identifies SDM goals, spatial dataset,
• Describe domain knowledge, e.g. well-known patterns, e.g. correlates
• Validation of new patterns
• Data Mining Analyst
• Helps identify pattern families, SDM techniques to be used
• Explain the SDM outputs to Domain Expert
• Joint effort
• Feature selection
• Selection of patterns for further exploration
[Ref: Spatial Databases: A Tour, by Shashi Shekhar, Sanjay Chawla; Internet resources]
Data Mining Process
Database
Association Rule
Domain DM Analyst Classification
Expert Clustering
Problem Statement
Interpretation
Action
Feedback
Output (Hypothesis) DM
Verification
Refinement Algorithms
Visualization Spatial SQL
Database
[Ref: Spatial Databases: A Tour, by Shashi Shekhar, Sanjay Chawla; Internet resources]
Choice of Methods
• Approaches to mining Spatial Data
• Pick spatial features; use classical DM methods
• Use novel spatial data mining techniques
• Possible Approach:
• Define the problem: capture special needs
• Explore data using maps, other visualization
• Try reusing classical DM methods
• If classical DM perform poorly, try new methods
• Evaluate chosen methods rigorously
• Performance tuning as needed
Families of SDM Patterns
• Note:
• Other families of spatial patterns may be defined
• SDM is a growing field, which should accommodate new pattern families
[Ref: Spatial Databases: A Tour, by Shashi Shekhar, Sanjay Chawla; Internet resources]
Location Prediction
• Question addressed
• Where will a phenomenon occur?
• Which spatial events are predictable?
• How can a spatial events be predicted from other spatial events?
• Equations, rules, other methods,
• Examples:
• Where will an endangered bird nest ?
• Which areas are prone to fire given maps of vegetation, draught, etc.?
• What should be recommended to a traveler in a given location?
Spatial Interactions - Examples
• Which spatial events are related to each other?
• Which spatial phenomena depend on other phenomenon?
[Ref: Spatial Databases: A Tour, by Shashi Shekhar, Sanjay Chawla; Internet resources]
Hot spots
• Question addressed
• Is a phenomenon spatially clustered?
• Which spatial entities or clusters are unusual?
• Which spatial entities share common
characteristics?
• Examples:
• Cancer clusters [CDC] to launch investigations
• Crime hot spots to plan police patrols
• Defining unusual
• Comparison group:
• neighborhood
• entire population
• Significance: probability of being unusual is high
[Ref: Spatial Databases: A Tour, by Shashi Shekhar, Sanjay Chawla; Internet resources]
Spatial Informatics
• Note:
• Other families of spatial patterns may be defined
• SDM is a growing field, which should accommodate new pattern families
[Ref: Spatial Databases: A Tour, by Shashi Shekhar, Sanjay Chawla; Internet resources]
Unique Properties of Spatial Patterns
• Items in a traditional data are independent of each other,
• whereas properties of locations in a map are often “auto-correlated”.
• Traditional data deals with simple domains, e.g. numbers and symbols,
• whereas spatial data types are complex
• Items in traditional data describe discrete objects
• whereas spatial data is continuous
• First law of geography [Tobler]:
• Everything is related to everything, but nearby things are more related than distant things.
• People with similar backgrounds tend to live in the same area
• Economies of nearby regions tend to be similar
• Changes in temperature occur gradually over space(and time)
[Ref: Spatial Databases: A Tour, by Shashi Shekhar, Sanjay Chawla; Internet resources]
Mapping Techniques to Spatial Pattern Families
• Overview
• Several techniques to find a spatial pattern family
• Choice of technique depends on feature selection, spatial data, etc.
• Spatial pattern families vs. Techniques
• Location Prediction: Classification, function determination
• Interaction : Correlation, Association, Colocations
• Hot spots: Clustering, Outlier Detection
• Focus on
• Spatial problems
• Even though these techniques apply to non-spatial datasets too
Spatial Autocorrelation - Basics
• What is Spatial Autocorrelation?
• Why Spatial Autocorrelation is Important?
• How to Measure Spatial Autocorrelation?
• Examples
Spatial Autocorrelation
• Spatial Autocorrelation is a special property of geospatial data.
• It is the formal property that measures the degree to which near and distant things are related
• It is a statistical test of match between
locational similarity and attribute similarity
• It is a property that is often exhibited by variables
which are sampled over space
• It is based on Tobler’s 1st law of geography.
• Examples:
• Temperature values of two locations near to each other will be similar.
Types of Spatial Autocorrelation
Types of
Spatial Autocorrelation
Positive Random
Autocorrelation Autocorrelation
Negative
Autocorrelation
Goals:
• To Measure the strength of spatial
autocorrelation in a map
• Test the assumption of independence
or randomness
• To explore whether there is any
clustering pattern in the data or is it
just a random data
Measuring Spatial Autocorrelation
Rook Case
For Irregular polygons
All polygons that share a
common border or have a
centroid within the circle
defined by the average distance
to centroids of polygons
that share a common border.
A B C D E F
A 0 1 1 1 0 0
B 1 0 1 0 1 0
wij = C 1 1 0 1 1 0
D 1 0 1 0 1 1
E 0 1 1 1 0 1
F 0 0 0 1 1 0
• Location‐specific statistics
• Used to determine if local autocorrelation exists around each region I
• Clusters/hot‐spots
• Heterogeneity
The i-th point for
which we calculate Ii
n(x i x) n
I i
n n w (x ij j x)
ij i
W
j 1
(x x
j1
) 2 j1
Neighborhood
Specified by the weights matrix
Join Count Statistics Method
Positive Autocorrelation
Negative Autocorrelation
• A join, or edge, is classified as either
WW (0-0), BB (1-1), or
BW (1-0).
Large proportion (or count) of
BW joins and Small proportion of
BB and WW joins
Join Count Statistic: Calculation
A B
D
C
E
n n
n wij (x i x)(x j x)
i 1 j1
I n n n 0.2806 I value is less than 0.
(Global) ( wij ) (x i x) 2 Therefore, the areal
i 1 j1 i 1 pattern may be
dispersed. Z-test is
required.
n(x i x) n
For i=C, I i
n n w (x ij j x) 0
(Local) W (x
j 1
ij
j1
i x) 2 j1
EXAMPLE: JOINT COUNT STATISTIC
A B
D
C
E
3 2
Given, 𝑝 + = 5 , 𝑝 − = 5 𝑇ℎ𝑒𝑟𝑒𝑓𝑜𝑟𝑒, 𝐸 + − = 2 × 7 × 3/5 × 2/5
=3.36
𝑚 = 28
2 2
𝑉 + − = 2 × (7 + 28) × 3/5 × 2/5 − 4 × (7 + (2 × 28)) × 3/5 × 2/5
=2.28
Example: Joint Count Statistic(contd.)
O : Observed Value
E : Expected Value
V : Variance
5−3.36
=
2.28
=1.086
• Spatial methods
• Spatial auto-regression (SAR),
• Markov random field
• Bayesian classifier
Association Rule Mining
• Classical method:
• Association rule given item-types and transactions
• Assumes spatial data can be decomposed into transactions
• However, such decomposition may alter spatial patterns
• Spatial methods
• Spatial association rules
• Spatial co-locations
• Note: Association rule or co-location rules are fast filters to reduce the number of pairs for
rigorous statistical analysis, e.g correlation analysis, cross-K-function for spatial interaction etc.
Associations, Spatial Associations, Co-location
Answers: and
Association Rules Discovery
• Examples (Spatial)
• (bedrock type = limestone), (soil depth < 50 feet) => (sink hole risk = high)
• Support = 20 percent, Confidence = 0.8
• Interpretation: Locations with limestone bedrock and low soil depth have high risk of
sink hole formation.
Association Rules: Formal Definitions
• Support of C (C) t | t T , C t
i1
Spatial Association Rules
• Spatial Association Rules
• A special reference spatial feature
• Transactions are defined around instance of special spatial feature
• Item-types = spatial predicates
Co-location Rules
• Motivation
• Association rules need transactions (subsets of instance of item-types)
• Spatial data is continuous
• Decomposing spatial data into transactions may alter patterns
• Co-location Rules
• For point data in space
• Does not need transaction, works directly with continuous space
• Use neighborhood definition and spatial joins
• “Natural approach”
Colocation Rules
Co-location rules vs. Association rules
• Population density
• Grouping Goal - central places
• identify locations that dominate surroundings
• Grouping goal - homogeneous areas
Techniques for Clustering
• Categorizing classical methods:
• Hierarchical methods
• Partitioning methods, e.g. K-mean, K-medoid
• Density based methods
• Grid based methods
• Spatial methods
• Comparison with complete spatial random processes
• Neighborhood EM
• Focus:
• Partitioning methods and new spatial methods
• Outlier detection has methods similar to density based methods
Outliers
• What is an outlier?
• Observations inconsistent with rest of the dataset
• Techniques for global outliers
• Statistical tests based on membership in a distribution
• Pr.[item in population] is low
• Non-statistical tests based on distance, nearest neighbors, convex hull, etc.
• Spatial outliers?
• Observations inconsistent with their neighborhoods
• A local instability or discontinuity
• Techniques for spatial outliers
• Graphical - Variogram cloud, Moran scatterplot
• Algebraic - Scatterplot, Z(S(x))
Summary
• Patterns are opposite of random
• Common spatial patterns: location prediction, feature interaction, hot spots,
• SDM = search for unexpected interesting patterns in large spatial databases
• Spatial patterns may be discovered using
• Techniques like classification, associations, clustering and outlier detection
• New techniques are needed for SDM due to
• Spatial Auto-correlation
• Continuity of space