You are on page 1of 90

Spatial Informatics

Prof. Soumya K. Ghosh


Computer Science and Engineering

Spatial Analysis (1)


Concepts Covered:

 Data Warehousing & Data Mining - Basics

 Spatial Datamining

 Spatial Autocorrelation

 Spatial Computing
Data, Data everywhere yet ....
• Can’t find the data I need
• data is scattered over the network
• many versions, subtle differences
• Can’t get the data I need
• need an expert to get the data
• Can’t understand the data I found
• available data poorly documented
• Can’t use the data I found
• results are unexpected
• data needs to be transformed from one
form to other
What is Data Warehousing?

A process of transforming data into


Information information and making it available to
users in a timely enough manner to make
a difference

[Forrester Research, April 1996]

Data
Data Warehouse?
• Different definitions -
• A decision support database that is maintained separately from the organization’s
operational database
• Support information processing by providing a solid platform of consolidated, historical data
for analysis.

• “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile


collection of data in support of management’s decision-making process.”—W. H.
Inmon

• Data warehousing:
• The process of constructing and using data warehouses
Data Warehouse—Subject-Oriented
• Organized around major subjects.
[For example - customer, product, sales]
• Focusing on the modeling and analysis of data for decision makers, not on daily
operations or transaction processing.
• Provide a simple and concise view around particular subject issues by excluding
data that are not useful in the decision support process.
Data Warehouse—Integrated

• Constructed by integrating multiple, heterogeneous data sources


• relational databases, flat files, on-line transaction records

• Data cleaning and data integration techniques are applied.


• Ensure consistency in naming conventions, encoding structures, attribute measures, etc.
among different data sources
• “Interoperability”
• When data is moved to the warehouse, it is converted.
Data Warehouse—Time Variant

• Time horizon for the data warehouse is significantly longer than that of
operational systems.
• Operational database: current value data.
• Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)

• Every key structure in the data warehouse


• Contains an element of time, explicitly or implicitly
• But the key of operational data may or may not contain “time element”.
Data Warehouse—Non-Volatile
• A physically separate store of data transformed from the operational
environment.
• Operational update of data does not occur in the data warehouse
environment.
• Does not require transaction processing, recovery, and concurrency control mechanisms
• Requires only two operations in data accessing:
• initial loading of data and access of data.
Data Warehouse vs. Heterogeneous DBMS
• Traditional heterogeneous DB integration:
• Build wrappers/mediators on top of heterogeneous databases
• Query driven approach
• When a query is posed to a client site, a meta-dictionary is used to translate
the query into queries appropriate for individual heterogeneous sites
involved, and the results are integrated into a global answer set
• Complex information filtering, compete for resources

• Data warehouse: update-driven, high performance


• Information from heterogeneous sources is integrated in advance and stored in
warehouses for direct query and analysis
Data Warehouse vs. Operational DBMS
• OLTP (on-line transaction processing)
• Major task of traditional relational DBMS
• Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting,
etc.
• OLAP (on-line analytical processing)
• Major task of data warehouse system
• Data analysis and decision making
• Distinct features (OLTP vs. OLAP):
• User and system orientation: customer vs. market
• Data contents: current, detailed vs. historical, consolidated
• Database design: ER + application vs. star + subject
• View: current, local vs. evolutionary, integrated
• Access patterns: update vs. read-only but complex queries
Why Data Warehouse?
• High performance for both systems
• DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery
• Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation.
• Different functions and different data:
• missing data: Decision support requires historical data which operational DBs do not typically
maintain
• data consolidation: DS requires consolidation (aggregation, summarization) of data from
heterogeneous sources
• data quality: different sources typically use inconsistent data representations, codes and
formats which have to be reconciled
Multi-dimensional Data Model – From Tables and Spreadsheets to Data Cubes
• A data warehouse is based on a multidimensional data model which views data
in the form of a data cube
• A data cube, such as sales, allows data to be modeled and viewed in multiple
dimensions
• Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter,
year)
• Fact table contains measures (such as dollars_sold) and keys to each of the related dimension
tables
• In data warehousing literature, an n-D base cube is called a base cuboid. The top
most 0-D cuboid, which holds the highest-level of summarization, is called the
apex cuboid. The lattice of cuboids forms a data cube.
Cube: A Lattice of Cuboids
all 0-D(apex) cuboid

time item location supplier


1-D cuboids

time,item time,location item,location location,supplier


2-D cuboids
time,supplier item,supplier

time,location,supplier
time,item,location
3-D cuboids
time,item,supplier item,location,supplier

time, item, location, supplier 4-D(base) cuboid


Conceptual Modeling of Data Warehouses

• Modeling data warehouses: dimensions & measures


• Star schema: A fact table in the middle connected to a set of dimension tables
• Snowflake schema: A refinement of star schema where some dimensional hierarchy is
normalized into a set of smaller dimension tables, forming a shape similar to snowflake
• Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of
stars, therefore called galaxy schema or fact constellation
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_state
country
avg_sales
Measures
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_state
Measures country
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location


branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type
A Concept Hierarchy: Dimension (location)

all all

region Europe ... North_America

country Germany ... Spain Canada ... Mexico

city Frankfurt ... Vancouver ... Toronto

office L. Chan ... M. Wind


Spatial Informatics

Prof. Soumya K. Ghosh


Computer Science and Engineering

Spatial Analysis (2)


Multidimensional Data
• Sales volume as a function of Product, Month, and Region

Dimensions: Product, Location, Time


Hierarchical summarization paths

Industry Region Year

Category Country Quarter


Product

Product City Month Week

Office Day

Month
A Sample Data Cube
Total annual sales
Date of TV in India
1Qtr 2Qtr 3Qtr 4Qtr sum
TV
PC India

Country
VCR
sum
China

Mexico

sum
Cuboids Corresponding to the Cube

all 0-D(apex) cuboid

product date country


1-D cuboids

product,date product,country date, country


2-D cuboids

product, date, country 3-D(base) cuboid


Typical OLAP Operations
• Roll up (drill-up): summarize data
• by climbing up hierarchy or by dimension reduction
• Drill down (roll down): reverse of roll-up
• from higher level summary to lower level summary or detailed data, or introducing new
dimensions
• Slice and dice:
• project and select
• Pivot (rotate):
• reorient the cube, visualization, 3D to series of 2D planes.
• Other operations
• drill across: involving (across) more than one fact table
• drill through: through the bottom level of the cube to its back-end relational tables
(using SQL)
Multi-Tiered Architecture

Monitor
& OLAP Server
other Metadata
sources Integrator

Analysis
Operational Extract
Serve Query
DBs Transform Data
Load Warehouse Reports
Refresh
Data mining

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools


Data Warehouse Development

Multi-Tier Data Warehouse

Distributed Data
Marts

Enterprise Data
Data Mart Data Mart
Warehouse

Model refinement Model refinement

Define a high-level corporate data model


DATA MINING - Basics
Data Mining
• Data mining refers to the discovery of new information in terms of patterns
or rules from vast amounts of data
• Data warehousing and Data mining
• The goal of data warehouse is to support decision making process
• Data mining can be used in conjunction with a data warehouse to help with certain
decisions
• Data mining can be applied to operational databases but to make it more efficient and
meaningful it is applied to data warehouses
• Data mining applications should be considered early during the design of a
data warehouse
“DW-DM” Architecture
Data Mining and Knowledge Discovery

• Knowledge discovery in databases (KDD) --- more general than data mining
• KDD process consists of six phases
1. Data selection 2. Data cleaning
3. Enrichment 4. Data transformation
5. Data mining 6. Display and reporting
• Example
Consumer goods retailer
• Association rule: whenever a customer buys product X he also buys product Y
• Sequential pattern: whenever a customer buys a camera then within six months he buys
photographic supplies
• Classification trees: credit-card customers, cash customers, etc.
Goals of Data Mining

• Prediction --- data mining can show how certain attributes within the data
will behave in the future
• Identification --- data patterns can be used to identify the existence of an
item, event, or an activity
• Classification --- data mining can partition the data so that different classes
or categories can be identified based on combinations of parameters
• Optimization --- one eventual goal of data mining may be to optimize the use
of limited resources such as time, space, money, or materials
Knowledge Discovery during Data Mining
• Raw data  Information  knowledge
• Deductive knowledge
• Deduce new information based on applying pre-specified logical rules of deduction on the
given data
• Inductive knowledge
• Discover new rules and patterns from the available data
• Data mining addresses inductive knowledge
• Discovered knowledge can be
• Unstructured like rules or propositional logic
• Structured like decision trees, semantic network, neural networks, etc
Types of Knowledge Discovered

Knowledge discovered during data mining can be described as


• Association rules --- correlate the presence of a set of items with another range of values for
another set of variables
• Classification hierarchies --- create hierarchies of classes
• Sequential patterns --- sequence of actions or events
• Pattern with time series --- similarities detected within positions of the time series
• Categorization and segmentation --- partition a given population of events or items into sets
of “similar” elements.
Association Rules
• An association rule is of the form X  Y
where X = {x1, x2, …., xn} and Y = {y1, y2, …, ym} are sets of distinct items
The rule states that if a customer buys X, he is also likely to buy Y
• The set LHS  RHS is called an itemset
• Interest measures
1. Support (prevalence) for the rule LHS  RHS is the percentage of transactions that hold all
the items in the itemset.
2. Confidence (strength) for the rule LHS  RHS is the percentage (fraction) of all transactions
that include items in LHS and out of these the ones that include items of RHS.
• Confidence is computed as support (LHS  RHS) / support (LHS)
Example
Tid time Items
101 6:35 milk, bread cookies, juice
102 7:38 milk, juice
103 8:05 milk, eggs
104 8:40 bread, cookies, coffee

Consider two rules milk  juice and bread  juice


Support {milk, juice} is 50%
Support {bread, juice} is 25%
Confidence of milk  juice is 66.7%
Confidence of Bread  juice is 50%
Generic Architecture of Data

(synonym) Transaction data


Data Mining Objectives:

• Forecasting what may happen in the future


• Classifying people or things into groups by recognizing patterns
• Clustering people or things into groups based on their attributes
• Associating what events are likely to occur together
• Sequencing what events are likely to lead to later events
Spatial Informatics

Prof. Soumya K. Ghosh


Computer Science and Engineering

Spatial Analysis (3)


Spatial Data Mining
Spatial Patterns

• Examples (Historic)
• 1855 Asiatic Cholera in London : A water pump identified as the source
• Fluoride and healthy gums near Colorado river
• Theory of Gondwanaland - continents fit like pieces of a jigsaw puzlle

• Examples (Recent)
• Cancer clusters to investigate environment health hazards
• Crime hotspots for planning police patrol routes
• Bald eagles nest on tall trees near open water
• Nile virus spreading from north east USA to south and west
• Unusual warming of Pacific ocean (El Nino) affects weather in USA

[Ref: Spatial Databases: A Tour, by Shashi Shekhar, Sanjay Chawla; Internet resources]
Spatial Pattern ?
• What is a Pattern?
• A frequent arrangement, configuration, composition, regularity
• A rule, law, method, design, description
• A major direction, trend, prediction
• A significant surface irregularity or unevenness

• What is not a pattern?


• Random, haphazard, chance, stray, accidental, unexpected
• Without definite direction, trend, rule, method, design, aim, purpose
• Accidental - without design, outside regular course of things
• Casual - absence of pre-arrangement, relatively unimportant
• Fortuitous - What occurs without known cause
[Ref: Spatial Databases: A Tour, by Shashi Shekhar, Sanjay Chawla; Internet resources]
Spatial Data Mining
• Metaphors
• Mining nuggets of information embedded in large databases
• Nuggets = interesting, useful, unexpected spatial patterns
• Mining = looking for nuggets

• Spatial Data Mining


• Search for spatial patterns
• Non-trivial search - as “automated” as possible—reduce human effort
• Interesting, useful and unexpected spatial pattern
Spatial Data Mining
• Non-trivial search for interesting and unexpected spatial pattern
• Non-trivial Search
• Large (e.g. exponential) search space of plausible hypothesis
• Ex. Asiatic cholera : causes: water, food, air, insects, …; water delivery mechanisms -
numerous pumps, rivers, ponds, wells, pipes, ...
• Interesting
• Useful in certain application domain
• Ex. Shutting off identified Water pump => saved human life
• Unexpected
• Pattern is not common knowledge
• May provide a new understanding of world
• Ex. Water pump - Cholera connection lead to the “germ” theory
[Ref: Spatial Databases: A Tour, by Shashi Shekhar, Sanjay Chawla; Internet resources]
What is NOT Spatial Data Mining?
• Simple Querying of Spatial Data
• Find neighbors of West Bengal given names and boundaries of all states
• Find shortest path from Kharagpur to Hyderabad in a national road network
• Search space is not large (not exponential)
• Testing a hypothesis via a primary data analysis
• Ex. Female chimpanzee territories are smaller than male territories
• Search space is not large !
• SDM: secondary data analysis to generate multiple plausible hypotheses
• Uninteresting or obvious patterns in spatial data
• Heavy rainfall in City-A is correlated with heavy rainfall in City-B, given that the two cities are far
apart.
• Common knowledge: Nearby places have similar rainfall
• Mining of non-spatial data
• Sales of Product-A and Product-B sales are correlated in the weekends
Why Spatial Data Mining?
• Two basic reasons for SDM
• Consideration of use in certain application domains
• Provide fundamental new understanding
• Application domains
• Scale up secondary spatial (statistical) analysis to very large datasets
• Find the epidemic clusters to locate hazardous environments
• Prepare land-use maps from satellite imagery
• Predict habitat suitable for endangered species
• Find new spatial patterns
• Find groups of co-located geographic features

[Ref: Spatial Databases: A Tour, by Shashi Shekhar, Sanjay Chawla; Internet resources]
Why Spatial Data Mining? (contd.)
• New understanding of geographic processes for Critical questions
• Ex. How is the health of planet Earth?
• Ex. Characterize effects of human activity on environment and ecology
• Ex. Predict effect of El Nino on weather, and economy
• Traditional approach: manually generate and test hypothesis
• But, spatial data is growing too fast to analyze manually
• Satellite imagery, GPS tracks, sensors on highways, …
• Number of possible geographic hypothesis too large to explore manually
• Large number of geographic features and locations
• Number of interacting subsets of features grow exponentially
• SDM may reduce the set of plausible hypothesis
• Identify hypothesis supported by the data
• For further exploration using traditional statistical methods
Spatial Data Mining: Actors
• Domain Expert -
• Identifies SDM goals, spatial dataset,
• Describe domain knowledge, e.g. well-known patterns, e.g. correlates
• Validation of new patterns
• Data Mining Analyst
• Helps identify pattern families, SDM techniques to be used
• Explain the SDM outputs to Domain Expert
• Joint effort
• Feature selection
• Selection of patterns for further exploration

[Ref: Spatial Databases: A Tour, by Shashi Shekhar, Sanjay Chawla; Internet resources]
Data Mining Process

Database
Association Rule
Domain DM Analyst Classification
Expert Clustering
Problem Statement
Interpretation

Action

Feedback

Output (Hypothesis) DM
Verification
Refinement Algorithms
Visualization Spatial SQL

Database
[Ref: Spatial Databases: A Tour, by Shashi Shekhar, Sanjay Chawla; Internet resources]
Choice of Methods
• Approaches to mining Spatial Data
• Pick spatial features; use classical DM methods
• Use novel spatial data mining techniques

• Possible Approach:
• Define the problem: capture special needs
• Explore data using maps, other visualization
• Try reusing classical DM methods
• If classical DM perform poorly, try new methods
• Evaluate chosen methods rigorously
• Performance tuning as needed
Families of SDM Patterns

• Common families of spatial patterns


• Location Prediction: Where will a phenomenon occur ?
• Spatial Interaction: Which subsets of spatial phenomena interact?
• Hot spots: Which locations are unusual ?

• Note:
• Other families of spatial patterns may be defined
• SDM is a growing field, which should accommodate new pattern families

[Ref: Spatial Databases: A Tour, by Shashi Shekhar, Sanjay Chawla; Internet resources]
Location Prediction
• Question addressed
• Where will a phenomenon occur?
• Which spatial events are predictable?
• How can a spatial events be predicted from other spatial events?
• Equations, rules, other methods,

• Examples:
• Where will an endangered bird nest ?
• Which areas are prone to fire given maps of vegetation, draught, etc.?
• What should be recommended to a traveler in a given location?
Spatial Interactions - Examples
• Which spatial events are related to each other?
• Which spatial phenomena depend on other phenomenon?

[Ref: Spatial Databases: A Tour, by Shashi Shekhar, Sanjay Chawla; Internet resources]
Hot spots
• Question addressed
• Is a phenomenon spatially clustered?
• Which spatial entities or clusters are unusual?
• Which spatial entities share common
characteristics?

• Examples:
• Cancer clusters [CDC] to launch investigations
• Crime hot spots to plan police patrols

• Defining unusual
• Comparison group:
• neighborhood
• entire population
• Significance: probability of being unusual is high
[Ref: Spatial Databases: A Tour, by Shashi Shekhar, Sanjay Chawla; Internet resources]
Spatial Informatics

Prof. Soumya K. Ghosh


Computer Science and Engineering

Spatial Analysis (4)


Categorizing Families of SDM Patterns
• Spatial Data Model concepts
• Entities: Categories of distinct, identifiable, relevant things
• Attribute: Properties, features, or characteristics of entities
• Instance of an entity - individual occurrence of entities
• Relationship: interactions or connection among entities, e.g. neighbor
• Degree - number of participating entities
• Cardinality - number of instance of an entity in an instance of relationship
• Self-referencing - interaction among instance of a single entity
• Instance of a relationship - individual occurrence of relationships

• Pattern families (PF) in entity relationship models


• Relationships among entities, e.g. neighbor
• Value-based interactions among attributes
Families of SDM Patterns

• Common families of spatial patterns


• Location Prediction:
• Determination of value of a special attribute of an entity is by values of other attributes of the
same entity
• Spatial Interaction:
• N-ry interaction among subsets of entities
• N-ry interactions among categorical attributes of an entity
• Hot spots: self-referencing interaction among instances of an entity

• Note:
• Other families of spatial patterns may be defined
• SDM is a growing field, which should accommodate new pattern families

[Ref: Spatial Databases: A Tour, by Shashi Shekhar, Sanjay Chawla; Internet resources]
Unique Properties of Spatial Patterns
• Items in a traditional data are independent of each other,
• whereas properties of locations in a map are often “auto-correlated”.
• Traditional data deals with simple domains, e.g. numbers and symbols,
• whereas spatial data types are complex
• Items in traditional data describe discrete objects
• whereas spatial data is continuous
• First law of geography [Tobler]:
• Everything is related to everything, but nearby things are more related than distant things.
• People with similar backgrounds tend to live in the same area
• Economies of nearby regions tend to be similar
• Changes in temperature occur gradually over space(and time)

[Ref: Spatial Databases: A Tour, by Shashi Shekhar, Sanjay Chawla; Internet resources]
Mapping Techniques to Spatial Pattern Families

• Overview
• Several techniques to find a spatial pattern family
• Choice of technique depends on feature selection, spatial data, etc.
• Spatial pattern families vs. Techniques
• Location Prediction: Classification, function determination
• Interaction : Correlation, Association, Colocations
• Hot spots: Clustering, Outlier Detection
• Focus on
• Spatial problems
• Even though these techniques apply to non-spatial datasets too
Spatial Autocorrelation - Basics
• What is Spatial Autocorrelation?
• Why Spatial Autocorrelation is Important?
• How to Measure Spatial Autocorrelation?
• Examples
Spatial Autocorrelation
• Spatial Autocorrelation is a special property of geospatial data.
• It is the formal property that measures the degree to which near and distant things are related
• It is a statistical test of match between
locational similarity and attribute similarity
• It is a property that is often exhibited by variables
which are sampled over space
• It is based on Tobler’s 1st law of geography.

• Tobler’s 1st law of geography:


“All places are related but nearby places are
more related than distant places”

• Examples:
• Temperature values of two locations near to each other will be similar.
Types of Spatial Autocorrelation
Types of
Spatial Autocorrelation

Positive Random
Autocorrelation Autocorrelation
Negative
Autocorrelation

Neighboring areas are Patterns exhibit no spatial


more alike autocorrelation
Neighboring areas are
unlike
Importance of Spatial Autocorrelation
• Most statistics are based on the assumption that the values of observations in
each sample are independent of one another.
• If the samples were taken from nearby areas, then positive spatial
autocorrelation may violate this.

Goals:
• To Measure the strength of spatial
autocorrelation in a map
• Test the assumption of independence
or randomness
• To explore whether there is any
clustering pattern in the data or is it
just a random data
Measuring Spatial Autocorrelation

Steps in determining the extent of spatial autocorrelation:

• Step-1: Find out which areas are linked to one another


 Choose a neighborhood criterion

• Step-2: Assign weights to the areas that are linked


 Create a spatial weights matrix

• Step-3: Run statistical test, using weights matrix, to examine spatial


autocorrelation
Neighborhood criteria
• Contiguity (common boundary)
• Distance (K-nearest neighbors, distance band)
• How many “neighbors” to include, what distance do we use?
Contiguity
• Adjacency - Sharing a border/boundary or point
• For Regular Polygons

Rook Case
 For Irregular polygons
All polygons that share a
common border or have a
centroid within the circle
defined by the average distance
to centroids of polygons
that share a common border.

Bishop Case Queen Case


Spatial Weight Matrix
• Weights based on Contiguity
• If zone j is adjacent to zone i, the interaction receives a weight of 1, otherwise it
receives a weight of 0 and is essentially excluded

A B C D E F
A 0 1 1 1 0 0
B 1 0 1 0 1 0
wij = C 1 1 0 1 1 0
D 1 0 1 0 1 1
E 0 1 1 1 0 1
F 0 0 0 1 1 0

• Weights based on Distance


• Uses a measure of the actual distance between points or between polygon centroids.
• Most common choices are:
• inverse (reciprocal) : wij = 1/dij
• inverse of squared distance : wij =1/dij2
2
• negative exponential : e-d or e-d
• length of shared boundary: wij= length (i, j)/length(i)
Spatial Informatics

Prof. Soumya K. Ghosh


Computer Science and Engineering

Spatial Analysis (5)


Statistical Tests to Examine Spatial Autocorrelation
Statistical Tests for presence of spatial autocorrelation
• Global Tests
• Moran’s I
• Geary’s C
• Local Tests
(LISA – Local Indicators of Spatial Autocorrelation)
• Local Moran’s I

• Other tests that are more simple:


• The Chi‐square Test
• The Join Count Statistic
Global Moran’s I
Product of the deviation from the mean
for all pairs of adjacent regions (wij=1)
n n
n w ij (x i  x)(x j  x)
i 1 j1
I n n n
Sum of the weights (count of all
adjacent pairs) ( w ij ) (x i  x) 2
i 1 j1 i 1
where, A measure of
n : the number of regions variance across the regions

x : the mean of the variable


xi : the variable value at a particular location i
wij : a weight indexing location of i relative to j

• Moran’s I Typically ranges from -1 to 1


• Indices close to zero, indicate random pattern
• Indices toward +1 indicate a tendency toward clustering
• Indices toward -1 indicate a tendency toward dispersion/uniform
Local Moran’s I - (LISA: local Indicators of Spatial Autocorrelation)

• Location‐specific statistics
• Used to determine if local autocorrelation exists around each region I
• Clusters/hot‐spots
• Heterogeneity
The i-th point for
which we calculate Ii

n(x i  x) n

I i
 n n  w (x ij j  x)
 ij  i
W
j 1
(x  x
j1
) 2 j1

Neighborhood
Specified by the weights matrix
Join Count Statistics Method
Positive Autocorrelation

• For binary (1,0) categorical data only


• Shown here as B/W
(black/white)
Small proportion (or count) of
BW joins and Large proportion of
BB and WW joins • Requires a contiguity matrix for
No Autocorrelation
polygons

• Based upon the proportion of “joins”


between categories. e.g.
Dissimilar proportions (or counts) • Total of 60 for Rook Case
of BW, BB and WW joins • Total of 110 for Queen Case

Negative Autocorrelation
• A join, or edge, is classified as either
WW (0-0), BB (1-1), or
BW (1-0).
Large proportion (or count) of
BW joins and Small proportion of
BB and WW joins
Join Count Statistic: Calculation

 Test Statistic is given by: Z= Observed - Expected


SD of Expected

Expected value is given Standard Deviation (SD) of Expected is given by:


by:

Where: k is the total number of joins (neighbors)


pB is the expected proportion Black
pW is the expected proportion White
m is calculated from k according to:
Example: Moran’s I

A B
D
C
E

n n
n wij (x i  x)(x j  x)
i 1 j1
I n n n  0.2806 I value is less than 0.
(Global) ( wij ) (x i  x) 2 Therefore, the areal
i 1 j1 i 1 pattern may be
dispersed. Z-test is
required.
n(x i  x) n
For i=C, I i
 n n  w (x ij j  x)  0
(Local) W  (x
j 1
ij
j1
i  x) 2 j1
EXAMPLE: JOINT COUNT STATISTIC

A B
D
C
E
3 2
Given, 𝑝 + = 5 , 𝑝 − = 5 𝑇ℎ𝑒𝑟𝑒𝑓𝑜𝑟𝑒, 𝐸 + − = 2 × 7 × 3/5 × 2/5
=3.36
𝑚 = 28
2 2
𝑉 + − = 2 × (7 + 28) × 3/5 × 2/5 − 4 × (7 + (2 × 28)) × 3/5 × 2/5
=2.28
Example: Joint Count Statistic(contd.)

O : Observed Value
E : Expected Value
V : Variance
5−3.36
=
2.28
=1.086

Therefore, we don’t reject H0 : “the areal pattern is random”


Location Prediction
• Classical method:
• Logistic regression, decision trees, bayesian classifier
• Assumes learning samples are independent of each other
• Spatial auto-correlation violates this assumption!
• Map display where the properties of a pixel is independent of the properties of other pixels?

• Spatial methods
• Spatial auto-regression (SAR),
• Markov random field
• Bayesian classifier
Association Rule Mining

• Classical method:
• Association rule given item-types and transactions
• Assumes spatial data can be decomposed into transactions
• However, such decomposition may alter spatial patterns
• Spatial methods
• Spatial association rules
• Spatial co-locations

• Note: Association rule or co-location rules are fast filters to reduce the number of pairs for
rigorous statistical analysis, e.g correlation analysis, cross-K-function for spatial interaction etc.
Associations, Spatial Associations, Co-location

Patterns in the dataset?

Answers: and
Association Rules Discovery

• Association rules has three parts


• rule: XY or antecedent (X) implies consequent (Y)
• Support = the number of time a rule shows up in a database
• Confidence = Conditional probability of Y given X

• Examples (Spatial)
• (bedrock type = limestone), (soil depth < 50 feet) => (sink hole risk = high)
• Support = 20 percent, Confidence = 0.8
• Interpretation: Locations with limestone bedrock and low soil depth have high risk of
sink hole formation.
Association Rules: Formal Definitions

• Consider a set of items, I  {i1 ,..., ik }

• Consider a set of transactions T  t1 ,..., tn 


• where each t i is a subset of I.

• Support of C  (C)  t | t T , C  t

• Then i1  i2 iff  (i1  i2 )


• Support: occurs in at least s percent of the transactions: |T |
• Confidence: Atleast c%  (i1  i2 )
 (i1 )

i1
Spatial Association Rules
• Spatial Association Rules
• A special reference spatial feature
• Transactions are defined around instance of special spatial feature
• Item-types = spatial predicates
Co-location Rules
• Motivation
• Association rules need transactions (subsets of instance of item-types)
• Spatial data is continuous
• Decomposing spatial data into transactions may alter patterns

• Co-location Rules
• For point data in space
• Does not need transaction, works directly with continuous space
• Use neighborhood definition and spatial joins
• “Natural approach”
Colocation Rules
Co-location rules vs. Association rules

Association rules Co-location rules

Underlying space discrete sets continuous space

item-types item-types events /Boolean spatial features

collection Transaction (T) Neighborhood (N)

prevalence measure support participation index

conditional probability metric Pr.[ A in T | B in T ] Pr.[ A in N(L) | B at location L ]

Participation index = min{pr(fi, c)}


Where pr(fi, c) of feature fi in co-location c = {f1, f2, …, fk}:
= fraction of instances of fi with feature {f1, …, fi-1, fi+1, …, fk} nearby
N(L) = neighborhood of location L
Clustering
• Clustering
• process of discovering groups in large databases.
• Spatial view: rows in a database = points in a multi-dimensional space
• Visualization may reveal interesting groups
• A diverse family of techniques based on available group descriptions
• Example: Census
• Attribute based groups
• Homogeneous groups, e.g. urban core, suburbs, rural
• Central places or major population centers
• Hierarchical groups: NE corridor, Metropolitan area, major cities, neighborhoods
• Areas with unusually high population growth/decline
• Purpose based groups, e.g. segment population by consumer behaviour
• Data driven grouping with little a priori description of groups
• Many different ways of grouping using age, income, spending, ethnicity, ...
Spatial Clustering Example

• Population density
• Grouping Goal - central places
• identify locations that dominate surroundings
• Grouping goal - homogeneous areas
Techniques for Clustering
• Categorizing classical methods:
• Hierarchical methods
• Partitioning methods, e.g. K-mean, K-medoid
• Density based methods
• Grid based methods
• Spatial methods
• Comparison with complete spatial random processes
• Neighborhood EM
• Focus:
• Partitioning methods and new spatial methods
• Outlier detection has methods similar to density based methods
Outliers
• What is an outlier?
• Observations inconsistent with rest of the dataset
• Techniques for global outliers
• Statistical tests based on membership in a distribution
• Pr.[item in population] is low
• Non-statistical tests based on distance, nearest neighbors, convex hull, etc.
• Spatial outliers?
• Observations inconsistent with their neighborhoods
• A local instability or discontinuity
• Techniques for spatial outliers
• Graphical - Variogram cloud, Moran scatterplot
• Algebraic - Scatterplot, Z(S(x))
Summary
• Patterns are opposite of random
• Common spatial patterns: location prediction, feature interaction, hot spots,
• SDM = search for unexpected interesting patterns in large spatial databases
• Spatial patterns may be discovered using
• Techniques like classification, associations, clustering and outlier detection
• New techniques are needed for SDM due to
• Spatial Auto-correlation
• Continuity of space

You might also like