You are on page 1of 62

Sulistyo

Puspitodjati

Spatial Data Mining

Sumber
Yang Yubin
Joint Laboratory for Geoinformation Science
The Chinese University of Hong Kong
yangyubin@cuhk.edu.hk
Agenda
• Motivation and General Description
• Data Mining: Basic Concepts
• Data Mining Techniques
• Spatial Data Mining
• Spatial Data Mining Scenarios in Meteorology
and Weather Forecasting
• Conclusions
• Questions & Discussions

2
• Motivation and General Description
• Data Mining: Basic Concepts
• Data Mining Techniques
• Spatial Data Mining
• Spatial Data Mining Scenarios in Meteorology
and Weather Forecasting
• Conclusions
• Questions & Discussions

3
Why do we need Data Mining?
• Large number of records(cases) (10 8-1012 bytes)
– One thousand (103) bytes = 1 kilobyte (KB)
– One million (106) bytes = 1 megabyte (MB)
– One billion (109) bytes = 1 gigabyte (GB)
– One trillion (1012) bytes = 1 terabyte (TB)
• High dimensional data (variables)
– 10-104 attributes
• Only a small portion, typically 5% to 10%, of
the collected data is ever analyzed
• We are drowning in data, but starving for
knowledge!
4
Scientific Viewpoint
• Data collected and stored at enormous speeds
(Gbyte/hour)
– remote sensor on a satellite
– telescope scanning the skies
– scientific simulations generating terabytes of data
• Classical modeling techniques are infeasible
• Data reduction
• Cataloging, classifying, segmenting data
• Helps scientists in Hypothesis Formation
5
Current Situations (1)
• Great efforts for construction and maintenance
of large information databases
• Data cannot be analyzed by standard statistical
methods
– numerous missing records
– data are qualitative rather than quantitative
• We do not always know what information
might be represented or how relevant it might
be to the questions
6
Current Situations (2)
• the ways and means for using all this data lag
far behind the increase of available data
– Information can only be found with:
• a lot of coincidence (internet)
• not explicitly available (company databases)
• only accessible for human eyes by using lots of
processing power (astronomical, meteorological and
earth observation data)
• This leads to a clear demand for means of
uncovering the information and knowledge
hidden in the massive quantities of data
7
• Motivation and General Description
• Data Mining: Basic Concepts
• Data Mining Techniques
• Spatial Data Mining
• Spatial Data Mining Scenarios in Meteorology
and Weather Forecasting
• Conclusions
• Questions & Discussions

8
What is Data Mining?
• Data mining is concerned with solving
problems by analyzing existing data
• “Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information or patterns from huge amount of
data”
• Alternative Names: Knowledge Discovery in
Databases (KDD)
– A term originated in Artificial Intelligence (AI) field
– KDD consists of several steps (one of which is Data Mining)

9
Data Mining vs. KDD
• Knowledge Discovery in Databases (KDD):
The whole process of finding useful
information and patterns in data
• Data Mining: Use of algorithms to extract
the information and patterns derived by the
KDD process
• Data mining is the core of the knowledge
discovery process

10
KDD Process

• Selection: Obtain data from various sources.


• Preprocessing: Cleanse data.
• Transformation: Convert to common format.
Transform to new format.
• Data Mining: Obtain desired results.
• Interpretation/Evaluation: Present results to user in
meaningful manner
11
Data Mining: A KDD Process
– Data mining: core of Pattern Evaluation

knowledge discovery
process Data Mining

Task-relevant Data

Data Selection
Warehouse
Data Cleaning

Data Integration

Databases
12
Typical Data Mining Architecture
Graphical user interface

Pattern evaluation

Data mining engine


Knowledge-base
Database or data
warehouse server
Data cleaning & data integration Filtering

Data
Databases Warehouse

13
Data Mining: Confluence of
Multiple Disciplines
Database
Statistics
Systems

Machine
Learning
Data Mining Visualization

Information
Algorithms,
Theory
…,Other
Disciplines

14
Data Mining is:
• A “hot” word for a class of techniques that find
patterns in data
• A user-centric, interactive process which leverages
analysis technologies and computing power
• A group of techniques that find relationships that
have not previously been discovered
• Not reliant on an existing database
• A relatively easy task that requires knowledge of
the business problem/subject matter expertise

15
Experts and clients are needed in:
• Define and redefine problems
• Determine relevant aspects of the problem
• Supply the data
• Remove errors from the data
• Provide constraints on possible patterns
• Interpret patterns and possibly reject
implausible ones
• Evaluate predicted effects…
16
• Motivation and General Description
• Data Mining: Basic Concepts
• Data Mining Techniques
• Spatial Data Mining
• Spatial Data Mining Scenarios in Meteorology
and Weather Forecasting
• Conclusions
• Questions & Discussions

17
Primary Data Mining Tasks (1)
• Descriptive Modeling
– Finding a compact description for large dataset
[Concept Description]
– Clustering people or things into groups based on
their attributes [Clustering]
– Associating what events are likely to occur together
[Association Rule]
– Sequencing what events are likely to lead to later
events [Sequential Pattern Analysis]
– Discovering the most significant changes
[Deviation Detection]
18
Primary Data Mining Tasks (2)
• Predictive Modeling
– Classifying people or things into groups by
recognizing patterns [Classification]
– Forecasting what may happen in the future by
mapping a data item to a predicting real-value
variable [Regression]

19
Concept Description
• Characterization: provides a concise and
succinct summarization of the given
collection of data
• Discrimination: provides descriptions
comparing two or more collections of data
• can handle complex data types of the
attributes
• a more automated process

20
Concept description: Characterization
Name Gender Major Birth-Place Birth_date Residence Phone # GPA

Initial Jim M CS Vancouver,BC, 8-12-76 3511 Main St., 687-4598 3.67


Woodman Canada Richmond
Relation Scott M CS Montreal, Que, 28-7-75 345 1st Ave., 253-9106 3.70
Lachance Canada Richmond
Laura Lee F Physics Seattle, WA, USA 25-8-70 125 Austin Ave., 420-5232 3.83
… … … … … Burnaby … …

Removed Retained Sci,Eng, Country Age range City Removed Excl,
Bus VG,..
Gender Major Birth_region Age_range Residence GPA Count
Generalized M Science Canada 20-25 Richmond Very-good 16
Relation F Science Foreign 25-30 Burnaby Excellent 22
… … … … … … …

Birth_Region
Canada Foreign Total
Gender
M 16 14 30
F 10 22 32
Total 26 36 62

21
Clustering
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Clustering
– Grouping a set of data objects into clusters based on the
principle: maximizing the intra-class similarity and
minimizing the interclass similarity
• Example
– Land use: Identification of areas of similar land use in
an earth observation database
– City-planning: Identifying groups of houses according
to their house type, value, and geographical location
22
Association rule
• Association (correlation and causality)
– age(X, “20..29”) ^ income(X, “20..29K”) buys(X,
“PC”) [support = 2%, confidence = 60%]
• Association rule mining
– Finding frequent patterns, associations, correlations
among sets of items or objects in transaction databases,
relational databases, and other information repositories
– Frequent pattern: pattern (set of items, sequence, etc.)
that occurs frequently in a database
• Motivation: finding regularities in data
– What products were often purchased together?
23
Example: Association rule

Transaction-id Items bought • Itemset A1,A2={a1, …, ak}


10 a1,a2, a3 • Find all the rules A1A2 with min
20 a1, a3 confidence and support
30 a1, a4 – support, s, probability that a
40 a2, a5, a6 transaction contains A1A2
– confidence, c, conditional
probability that a transaction
Let min_support = 50%,having A1 also contains A2.
min_conf = 50%:
a1  a3 (50%, 66.7%)
a3  a1 (50%, 100%)
24
Sequential Pattern Analysis
• Given a set of sequences, find the complete set of
frequent subsequences
SID sequence
Given support threshold
10 <a(abc)(ac)d(cf)>
min_sup =2, <(ab)c> is a
20 <(ad)c(bc)(ae)>
sequential pattern
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
• Applications of sequential pattern
– Customer shopping sequences:
• First buy computer, then CD-ROM, and then digital camera, within
3 months.
– Weblog click streams
– Telephone calling patterns

25
Deviation Detection
• Outlier analysis
– Outlier: a data object that does not comply with
the general behavior of the data
– It can be considered as noise or exception but is
quite useful in fraud detection, rare events
analysis
• Trend and evolution analysis
– Trend and deviation: regression analysis
– Periodicity analysis
– Similarity-based analysis

26
Classification and Regression
• Classification:
– constructs a model (classifier) based on the
training set and uses it in classifying new data
– Example: Climate Classification,…
• Regression:
– models continuous-valued functions, i.e.,
predicts unknown or missing values
– Example: stock trends prediction,…

27
Classification (1): Model Construction
Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’

28
Classification (2): Prediction Using the Model

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes

29
Classification Techniques
• Decision Tree Induction
• Bayesian Classification
• Neural Networks
• Genetic Algorithms
• Fuzzy Set and Logic

30
Regression
• Regression is similar to classification
– First, construct a model
– Second, use model to predict unknown
value
• Methods
– Linear and multiple regression
– Non-linear regression
• Regression is different from
classification
– Classification refers to predict categorical
class label
– Regression models continuous-valued
functions
31
Are All the “Discovered” Patterns
Interesting?
• A data mining task may generate thousands of
patterns, not all of them are interesting.
• Interestingness measures:
– A pattern is interesting if it is easily understood by
humans, valid on new or test data with some degree of
certainty, potentially useful, novel, or validates some
hypothesis that a user seeks to confirm
– Objective vs. Subjective interestingness measures:
• Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.
• Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty, executability, etc.
32
• Motivation and General Description
• Data Mining: Basic Concepts
• Data Mining Techniques
• Spatial Data Mining
• Spatial Data Mining Scenarios in Meteorology
and Weather Forecasting
• Conclusions
• Questions & Discussions

33
Spatial Data Mining
• Spatial Patterns
– Spatial outliers
– Location prediction
– Associations, co-locations
– Hotspots, Clustering, trends, …
• Primary Tasks
– Mining Spatial Association Rules
– Spatial Classification and Prediction
– Spatial Data Clustering Analysis
– Spatial Outlier Analysis
• Example: Unusual warming of Pacific ocean (El
Nino) affects weather in USA…
34
Spatial Data Mining Results
• Understanding spatial data, discovering
relationships between spatial and nonspatial data,
construction of spatial knowledge bases, etc.
• In various forms
– The description of the general weather patterns in a set
of geographic regions is a spatial characteristic rule.
– The comparison of two weather patterns in two
geographic regions is a spatial discriminant rule.
– A rule like “most cities in Canada are close to the
Canada-US border” is a spatial association rule
• near(x,coast) ^ southeast(x, USA) ) hurricane(x), (70%)
– Others: spatial clusters,…

35
What is Spatial Data?
• The data related to objects
that occupy space
– traffic, bird habitats, global
climate, logistics, ...
• Object types:
– Points, Lines, Polygons,etc.

Used in/for:

GIS - Geographic Information Systems

Meteorology

Astronomy

Environmental studies, etc.

36
Basic Concepts (1)
• Spatial data mining follows along the same functions
in data mining, with the end objective to find patterns
in geography, meteorology, etc.
• The main difference (Spatial autocorrelation)
– the neighbors of a spatial object may have an influence on
it and therefore have to be considered as well
• Spatial attributes
– Topological
• adjacency or inclusion information
– Geometric
• position (longitude/latitude), area, perimeter, boundary polygon

37
Basic Concepts (2)
• Spatial neighborhood
– Topological relation
• “intersect”, “overlap”, “disjoint”,

– distance relation
• “close_to”, “far_away”,… Global Model

– direction/orientation relation
• “left_of”, “west_of”,…
• Global model might be
inconsistent with regional
models Local Model
38
Applications
• NASA Earth Observing System (EOS):
Earth science data
• National Inst. of Justice: crime mapping
• Census Bureau, Dept. of Commerce: census
data
• Dept. of Transportation (DOT): traffic data
• National Inst. of Health(NIH): cancer
clusters
• ……

39
Example: What Kind of Houses Are Highly
Valued?—Associative Classification

40
• Motivation and General Description
• Data Mining: Basic Concepts
• Data Mining Techniques
• Spatial Data Mining
• Spatial Data Mining Scenarios in
Meteorology and Weather Forecasting
• Conclusions
• Questions & Discussions

41
Meteorological Data Mining
• Motivation
– Lot of analysis methods must be applied to fast growing
data for climate studies
• Result
– Appropriate presentation instruments (graphs, maps,
reports, etc) must be applied
• Examples
– Spatial outliers can be associated with disastrous natural
events such as tornadoes, hurricane, and forest fires
– Associations between disaster events and certain
meteorological observations
42
Case Studies (1): Astronomy
• SKICAT(SKy Image Cataloging and
Analysis Tool ) (Caltech, US)
• The Palomar Observatory discovered
22 quasars with the help of data
mining
• the Second Palomar Observatory Sky
Survey (POSS-II)
– decision tree methods
– classification of galaxies, stars and other
stellar objects
• About 3 TB of sky images were
analyzed
43
Case Studies (2): NCAR & UCAR
• National Center for Atmospheric Research (NCAR) &
University Corporation for Atmospheric Research(UCAR), US
– http://www.ucar.edu/
• “Automatic Fuzzy Logic-based systems now compete
with human forecasts”
• Richard Wagoner, Deputy Director at Research Applications
Program(RAP), NCAR
• Intelligent Weather System (IWS)
– Detection and forecast in the areas of en-route turbulence,
en-route icing, ceiling/visibility, and convective hazards in
the aviation community
– Road winter maintenance, airport operations, and flash flood
forecasting
44
Operational Application
• Prediction System: WIND-2
– WIND: “Weather Is Not Discrete”
• Consists of three parts:
– Data
• Past airport weather observations, 30 years of hourly
observations, time series of 300,000 detailed observations
• Recent and current observations (METARs)
• Model based guidance (knowledge of near-term changes,e.g.,
imminent wind-shift, onset/cessation of precipitation)
– Fuzzy similarity-measuring algorithm
– Prediction composition – predictions based on k nearest
neighbors(k-nn, clustering method)
45
Operational Application
• Hybrid methods are used to predict weather
– Dynamical approach - based upon equations of
the atmosphere,uses finite element techniques
– Empirical approach - similar weather situations
lead to similar outcomes
• WIND runs in real-time for
meteorologically different sites
• Data-mining/forecast process
takes about one second
46
47
Case Studies (3): CrossGrid (EU)
• Objective
– To develop, implement and exploit new Grid components
for interactive compute and data intensive applications like
flooding crisis team decision support systems, air pollution
combined with weather forecasting
• Main tasks in Meteorological applications package
– Data mining for atmospheric circulation patterns
• Find a set of representative prototypes of the atmospheric patterns
in a region of interest
– Weather forecasting for maritime applications
– Ocean wave forecasting by models of various complexity

48
• Data
– ERA-15 using a T106L31 model (from 1978 to 1994) with 1.125◦ resolution
– Terabytes
– Comprises data from approx. 20 variables (such as temperature,humidity,
pressure, etc.) at 30 pressure levels of a 360x360 nodes grid

SOM Application for DataMining

Adaptive
Competitive
Learning

Downscaling Weather Forecasts


Sub-grid details scape from numerical models

49
Dept. of Applied
Mathematics
Universidad de
Cantabria
Santander, Spain

50
Case Studies (4): Typhoon Image
Data Mining
• Objective
– To establish algorithms and database models for the
discovery of information and knowledge useful for typhoon
analysis and prediction
– Content-based image retrieval technology to search for
similar cloud patterns in the past
– Data mining technology to extract spatio-temporal pattern
information which is meaningful from the meteorology
viewpoints
• Result
– Alignment of Multiple Typhoons, Explore by Projection to
2D Plane, Diurnal Analysis

51
Methods
• Archive of approximately 34,000 typhoon images for
the northern and southern hemisphere
• Various data mining approaches
– Principal component analysis(PCA), K-means clustering,
self-organizing map(SOM), wavelet transform
• Retrieval of historical similar patterns from image
databases to perform instance-based typhoon
analysis and prediction
• Extracting the eigenvectors of the whole typhoon
image collection

52
53
Case Studies (5): LEAD
• Linked Environments for Atmospheric Discovery
– To accommodate the real time, on-demand, and
dynamically-adaptive nature of mesoscale problems
• Complexities: vastly disparate, high volume and bandwidth
data
• Tremendous computational demands
– Used in accessing, preparing, assimilating, predicting,
managing, mining/analyzing, and displaying a broad
array of meteorological and related information
• Data Mining Solution Center: ITSC, The Univ. of
Alabama in Huntsville, US
– http://datamining.itsc.uah.edu/index.jsp

54
ADaM
• The Algorithm Development and Mining
– Component architecture data mining toolkit
– For geophysical phenomena detection and
feature extraction
• Applications
– Detecting tropical cyclones and estimating their
maximum sustained wind speed
– Mesocyclone Identification from RADAR
– Detecting Cumulus Cloud Fields in GOES
Images

55
ADaM (cont’d)
– Mesoscale Convective
Systems Detection
• EOS Special Sensor
Microwave/Imager (SSM/I)
Brightness Temperature
Swaths from DMSP F13 and
F14
– Rain Detection Using SSM/I
– Lightning Detection Using
OLS
– Rain Accumulation Study
56
Case Studies (6): Rainfall Classification
University of
Oklahoma Norman
• To classify significant and interesting features within a
two-dimensional spatial field of meteorological data
– Observed or predicted rainfall
• Data source
– Estimates of hourly accumulated rainfall
– Using radar and raingage data
• “Attributes” for classification
– Statistical parameters representing the distribution of rainfall
amounts across the region
• Classification Method
– Hierarchical cluster analysis
57
Many Others…
• JARtool Project (Fayyad et al., NASA )
• Identifying volcanoes on the surface of
Venus from images transmitted by the
Magellan spacecraft
• More than 30,000 high resolution Synthetic
Aperture Radar(SAR) images of the surface
of Venus from different angles
• The obtained accuracy was about 80%

58
What we can learn from those scenarios?
• Data Mining is a promising way for
meteorological analysis
• Very strong interaction between scientists and
the knowledge discovery system is necessary
• The users define features of the meteorological
phenomena based on their expert knowledge
• The system extracts the instances of such
phenomena
• Then, further analysis of phenomena is
possible
59
• Motivation and General Description
• Data Mining: Basic Concepts
• Data Mining Techniques
• Spatial Data Mining
• Spatial Data Mining Scenarios in Meteorology
and Weather Forecasting
• Conclusions
• Questions & Discussions

60
Conclusions
• Data mining: discovering interesting patterns from
large amounts of data
• A natural evolution of database technology, in great
demand, with wide applications
• A KDD process includes data mining, and other steps
• Data Mining can be performed in a variety of
information repositories
• Data mining Tasks: characterization, discrimination,
association, classification, clustering, outlier and trend
analysis, etc.
61
And now
discussion

62

You might also like