You are on page 1of 39

BIT33603 DATA

MINING
Chapter 1
INTRODUCTION
Dr. Aida Mustapha
aidam@uthm.edu.my

What is Pattern
Recognition
Pattern Recognition by Human
perceptual
specialized decision making
Pattern Recognition by Computers
benefit of automated pattern
recognition
advantage in complex calculations
Pattern Recognition from Data (Data
Mining)
2

Pattern Recognition
from Data
Pattern recognition from data is the
process of learning the historical data
by finding data dependency and getting
the knowledge from data.

What is Data

Studies
Education
Works Income (D)
Poor
SPM Poor
None
Poor
SPM Good Low
Moderate
SPM Poor
Low
Moderate
Diploma Poor
Low
Poor
SPM Poor
None
Moderate
Diploma Poor
Low
Good
MSC Good Medium

1
2
3
4
5
6
7
:
99 Poor
100 Moderate

SPM Good Low


Diploma Poor
Low

What is Knowledge
studies(Poor) AND work(Poor) => income(None)
studies(Poor) AND work(Good) => income(Low)
education(Diploma) => income(Low)
education(MSc) => income(Medium) OR income(High)
studies(Mod) => income(Low)
studies(Good) => income(Medium) OR income(High)
education(SPM) AND work(Good) => income(Low)

Why is Data Mining


Prevalent
Lots of data is collected and stored in
data warehouses
Business Wal-Mart logs nearly 20
million transactions per day
Astronomy Telescope collecting large
amounts of data.
Space NASA is collecting petabytes of
data from satellites
Physics High energy physics
experiments are expected to generate
100 to 1000 terabytes in the next
decade.
6

Why is Data Mining


Prevalent
Quality and richness of data
collected is improving
Retailers
Scanner data is much more accurate than
other means

E-commerce
Rich data on customer browsing

Science
Accurate of sensor is improving
7

Why is Data Mining


Prevalent?
The gap between data and analysts
is increasing
Existing of Hidden information
High cost of human labor
Much of data is never analyzed at all

Origins of Data Mining


Drawn ideas from Machine
Learning, Pattern Recognition,
Statistics, and Database Systems
for applications that have
Enormous of data
High dimensionality of data
Heterogeneous data
Unstructured data
9

Data Mining Multiple


Discipline
HPerformance
computing

Database
technology

visualization
Pattern
recognition

statistic
Machine
learning

DATA
MINING

Spatial
data analysis

Information
retrieval
Information
science

Neural network

10

Data Mining What it


isnt
Small Scale
Data mining methods are designed for large
data sets
Foolproof
Data mining techniques will discover
patterns in any data
The patterns discovered may be meaningless
It is up to the user to determine how to
interpret the results
Magic
Data mining techniques cannot generate
information that is not present in the data
They can only find the patterns that are
11
already there

Example: Data Mining


is not
Generating multidimensional cubes of a
relational table
Searching for a phone number in a
phone book
Searching for keywords on Google (IR)
Generating a histogram of salaries for
different age groups
Issuing SQL query to a database, and
reading the reply
12

Data Mining What it is


Extracting knowledge from large
amounts of data
Uses techniques from:
Pattern Recognition
Machine Learning
Statistics
Plus techniques unique to data mining
(Association rules)
Data mining methods must be efficient
and scalable
13

Example: Data Mining


is
What goods should be promoted to this customer?
What is the probability that a certain customer will
respond to a planned promotion?
Can one predict the most profitable securities to
buy/sell during the next trading session?
Will this customer default on a loan or pay back on
schedule?
What medical diagnose should be assigned to this
patient?
What kind of cars should be sell this year??
Finding groups of people with similar hobbies
Are chances of getting cancer higher if you live
near a power line?
14

Data Mining is simply...


Find relationship
Make prediction

15

Data Mining: Definition


The non trivial extraction of
implicit, previously unknown, and
potentially useful information from
data
(William J Fawley, Gregory
Piatetsky-Shapiro and Christopher
J Matheus)

16

Data Mining: 1-step of KDD


Knowledge
Evaluation&
Presentation
DataMining

Patterns

Selectionand
Transformation

Cleaningand
Integration

Databases

Data
Warehouse

Flatfiles
17

Data Mining: 1-step of KDD


Data cleaning
To remove noise and inconsistent data

Data integration
Multiple data sources may be combined

Data selection
Data relevant to the analysis task are
retrieved from the database

Data transformation
Data are transformed or consolidated into
forms appropriate for mining by performing
summary or aggregation operations
18

Data Mining: 1-step of KDD


Data mining
An essential process where intelligent
methods are applied in order to extract data
patterns

Pattern evaluation
To identify the truly interesting patterns
representing knowledge based on some
interestingness measures

Knowledge presentation
Visualization and knowledge representation
techniques are used to present the mined
knowledge to the users
19

Early Steps of Data


Mining
Data preprocessing
handling incomplete data, noisy data,
uncertain data

Data discretization/representation
transforms data into suitable values for the
mining algorithm to find patterns

Data selection
selects the suitable data for mining
purposes

20

Database Systems

Kinds of DB

Kinds of Knowledge

Relational
Data warehouse
Transactional DB
Advanced DB system
Flat files
WWW
Classification
Association
Clustering
Prediction

21

Data Mining Types of


Data

Mining can be performed on data in a variety of forms


Relational Database

Traditional DMBS everyone is familiar with


Data is stored in a series of tables (Collection of
tables)
Data is extracted via queries, typically with SQL
SQL: Show me a list of items that were sold in the
last quarter
show me the total sales of the last month, grouped
by branch
How many transactions occurred in the month of
December?
which sales person had the highest amount of
sales
22
Relational language: aggregate function
such as

Data Mining Types of


Data
Apply data mining go further

Searching for trends or data patterns

Analyzed customer data to predict credit risk of new


customers based on their income
Detect deviation items whose sales are far from those
expected in comparison with the previous year (further
investigated: change in packaging, increase in price?)

Transaction Database

Similar to relational database (transactions stored in a


table)

Each row (record) is a transaction with id & list of


items in transaction
Nested relation
Can be unfolded into a relational database or stored
in flat files since nested relational structures did not
supported by relational db system
Which items sold well together?

23

Data Mining Types of


Data
Data Warehouse
Stores historical data, potentially from multiple
sources
Organized around major subjects
Contains summary statistics

Object / Object-Relational Databases


Database consisting of objects
Object = set of variables + associated methods
Eg: Intel uses regularity extraction in automatic
circuit layout

Images
Can mine features extracted from images, OR
Can use mining techniques to extract features
Content based image retrieval
24

Data Mining Types of


Data
Vector Geometries (spatial db)

Include GIS and CAD data


Raster data n-dimensional bit maps /pixel maps
Vector format point, line, polygon
Can find spatial patterns between features
Describing the characteristics of houses located near
a specified kind of location
Describe the climate of mountainous areas located at
various altitudes

Text
Can be unstructured, semi-structured, or structured
Documentation, newspaper articles, web sites etc.
Can facilitate search by linking related documents /
concepts
25

Data Mining Types of


Data
Video / Audio
Speech recognition recognized spoken command
Security applications
Integrated with standard data mining methods
(storage and searching)

Temporal Databases / Time Series

Global change databases (temperature records)


Space shuttle telemetry
Stock market data (stock exchange)
Usually stores relational data that include timerelated attributes
Find the trend of changes for objects decision
making/strategy planning
26

Data Mining Types of


Data
Stock exchange data can be mined to uncover trends
that could help in planning investment strategies
(when is the best time to purchase TNB stock?)

Legacy Databases
Group of heterogeneous databases (relational, OO
db, network db, multimedia db etc.)
Connected by intra- or inter-computer networks
Information exchange is very difficult student
academic performance among different
schools/universities
Data mining transforming the given data into
higher, more generalized, conceptual levels

27

Evolution of DB
Technology
Data mining can viewed as a result of
the natural evolution of database
technology (Fig. 1.1).
The figure shows 5 stages of
functionalities:
- data collection and database creation
- database management systems
- advanced databases systems
- web-based databases systems
- data warehousing and data mining
28

29

Evolution of DB
Technology
Databases systems provide data storage
and retrieval, and transaction
processing.
Data warehousing and data mining
provide data analysis and
understanding.
Data ware house is a database
architecture that store many different
types of databases, a repository of
multiple heterogeneous data sources.
They are organized under a unified
schema at a single site in order to
facilitate management decision making.
30

Evolution of DB
Technology
Data warehouse technology includes:
data cleansing

data integration, and


On-Line Analytical Processing (OLAP)
OLAP is the analysis technique for
performing summarization,
consolidation, and aggregation, as well
as ability to view information from
different angles.
Although OLAP tools support data
analysis but not in-depth-analysis such
as data classification, clustering, and
the characterization of data changes
31
over time

DBMS, OLAP and Data


Mining
Area
Task

DBMS

OLAP

Data Mining

Extraction of detailed
and summary data

Summaries, trends and


forecast

Knowledge discovery
of hidden patterns
and insight

Type of
result

Information

Analysis

Insight and
prediction

Method

Deduction (Ask the


question, verify with
data)

Multidimensional data
modeling, Aggregation,
statistics

Induction (Build the


model, apply it to
new data, get the
result)

Example
question

Who purchased mutual


funds in the last 3 years

What is the average income


of mutual fund buyers by
region by year?

Who will buy a


mutual fund in the
next 6 months and
why?

32

Example: Weather data


Record of the weather conditions during
a two-week period, along with the
decisions of a tennis player whether or
not to play tennis on each particular day
Generated tuples (or examples,
instances) consisting of values of 4
independent variables

Outlook
Temperature
Humidity
Windy

One dependent variable - play

33

Day

outlook

temperature

humidity

windy

play

sunny

85

85

false

No

sunny

80

90

true

No

overcast

83

86

False

Yes

rainy

70

96

False

Yes

rainy

68

80

False

Yes

rainy

65

70

True

No

overcast

64

65

True

Yes

sunny

72

95

False

No

sunny

69

70

False

Yes

10

rainy

75

80

False

Yes

11

sunny

75

70

True

Yes

12

overcast

72

90

True

Yes

13

overcast

81

75

False

Yes

14

rainy

71

91

true

no

34

DBMS
We may answer questions by
querying a DBMS containing the
above table
What was the temperature in the
sunny days?
Which days the humidity was less
than 75?
Which days the temperature was
greater than 70?
Which days the temperature was
35
greater than 70 and the humidity
was

OLAP
Using OLAP (Online Analytical
Processing) create
Multidimensional Model (Data
cube)
Eg. Dimensions: time, outlook, play
9/5
sunny
rainy
overcast
can create the model below
Week
1

0/2

2/1

2/0

Week
2

2/1

1/1

2/0

36

OLAP
Observing the data cube easily
identify some important properties of
the data
Find regularities or pattern

Eg. The 3rd column: if the outlook


is overcast the play attribute is
always yes
If outlook = overcast then play = yes
37

Drill-down: time
dimension
9/5
Concept hierarchy
1

sunny

rainy

overcast

0/1

0/0

0/0

0/1

0/0

0/0

0/0

0/0

1/0

0/0

1/0

0/0

0/0

1/0

0/0

0/0

0/1

0/0

0/0

0/0

1/0

0/1

0/0

0/0

1/0

0/0

0/0

10

0/0

1/0

0/0

11

1/0

0/0

0/0

12

0/0

0/0

1/0

13

0/0

0/0

1/0

14

0/0

0/1

0/0

38

Roll-up (reverse of drilldown)


9/5

sunny

rainy

overcast

Week
1

0/2

2/1

2/0

Week
2

2/1

1/1

2/0

39