Professional Documents
Culture Documents
106-1012 bytes:
we never see the What is the knowledge?
whole data set, so will How to represent
put it in the memory of and use it?
computers
Why do we need KDD ?
Some Data Overload Examples:
Science
Retail Marketing
Data
Health care transactions: multi-gigabyte Overload
databases
Interpretation Knowledge
& Evaluation
Knowledge
Raw
Dat __ __ __
Patterns
Understanding
__ __ __
a __ __ __ and
Rules
Transformed
DATA Target Data
Ware Data
house
Knowledge Discovery in Database
Verification, Model
Operational Evaluation Patterns
Databases
Knowledge Discovery Process
Goals
Data Cleaning
Data Mining
• Goals
STEP – 4: DATA REDUCTION AND
• Data Selection, PROJECTION
Acquisition & Integration
• Data Cleaning • Finding useful features to represent the data
• Data reduction and depending on the goal of the task.
Projection • With dimensionality reduction or
•Matching the goals transformation methods, the effective
• Exploratory Data
number of variables under consideration can
Analysis
• Data Mining
be reduced, or invariant representations for
•Interpretation and the data can be found.
Testing
• Consolidation & Use
Knowledge Discovery Process
• Goals
STEP – 6: EXPLORATORY ANALYSIS AND
• Data Selection, MODEL & HYPOTHESIS SELECTION
Acquisition & Integration
• Data Cleaning • Choosing the data mining algorithms and
• Data reduction and selecting methods to be used for searching
Projection for data patterns.
•Matching the goals • This process includes deciding which models
• Exploratory Data
and parameters might be appropriate and
Analysis
• Data Mining
matching a particular data-mining method
• Interpretation and with the overall criteria of the KDD process.
Testing
• Consolidation & Use
Knowledge Discovery Process
OLTP
Data Cleaning
Inventory
Data
Warehouse
(OLAP)
Data Cleaning
• Performs logical transformation of transactional data to suit the data
warehouse
• Model of operations model of enterprise
• Usually a semi-automatic process
Data Warehouse
Orders
Order_id Customers
Price Products
Cust_id Orders
Inventory
Price
Inventory
Sales Time
Prod_id
Cust_id
Price
Cust_profit
Price_change
Total_sales
Primary Tasks of Data Mining
finding the description
identifying a finite
of several predefined
set of categories or
classes and classify
clusters to describe
a data item into one
the data.
of them. Clustering
Classification
finding a model
maps a data item which describes
? significant dependencies
to a real-valued
prediction variable. between variables.
Regression Dependency
Modeling
discovering the finding a
most significant compact description
changes in the data for a subset of data
Deviation and
change detection
Summarization
Data Mining Algorithm Components
• Model representation
– descriptions of discovered patterns
– overly limited representation -- unable to capture data patterns
too powerful -- potential for over fit.
(decision trees, rules, linear/non-linear regression & classification,
nearest neighbor and case-based reasoning methods, graphical
dependency models)
Descriptive Predictive
Clustering Classification
Neural Networks
Regression
Association Rule: Application
9
10
8 10
9
9
7
8
8
6
7
5 7
Update
6
6
4
Assign 5
5
the
3
2 each of 4
4
1
the
3 cluster 3
means
2
0 2
0 1 2 3 4 5 6 7 8 9 10 objects 1
1
to most 0
0 1 2 3 4 5 6 7 8 9 10
0
0 1 2 3 4 5 6 7 8 9 10
similar
center reassign
K=2
10
Arbitrarily choose K 9
objects as initial 8
5 the
4
cluster
means
3
0
0 1 2 3 4 5 6 7 8 9 10
Decision Tree Identification: Application
Yes/No
Cloudy Overcast
Sunny
Pleasant Chilly
Warm
Chilly
No Pleasant
Yes No Yes
Yes
Major Application Areas for Data
Mining (Classification)
• Advertising
• Bioinformatics
• Customer Relationship Management (CRM)
• Database Marketing
• Fraud Detection
• ecommerce
• Health Care
• Investment/Securities
• Manufacturing, Process Control
• Sports and Entertainment
• Telecommunications
• Web
Major Application Areas for Data
Mining: Marketing
• Direct Marketing:
Most major direct marketing companies are using
modeling and data mining.
• Customer segmentation:
All industries can take advantage of DM to discover
discrete segments in their customer bases by considering
additional variables beyond traditional analysis.
• CRM:
Find other people in similar life stages and determine
which customers are following similar behavior patterns For e.g. Verizon
– Up-sell Wireless
– Cross-sell reduced churn
– Keeping the customers for a longer period of time rate from 2% to
1.5%
Major Application Areas for Data
Mining: Fraud Detection
• Database Retailing:
Retailers can develop profiles of customers with
certain behaviors, for example, those who purchase
designer labels clothing or those who attend sales.
• Customer loyalty:
Some customers repeatedly switch providers, or
“churn”, to take advantage of attractive incentives
by competing companies. The companies can use
DM to identify the characteristics of customers
who are likely to remain loyal once they switch,
thus enabling the companies to target their
spending on customers who will produce the most
profit.
Major Application Areas for Data
Mining: Manufacturing
• Manufacturing:
Through choice boards, manufacturers are
beginning to customize products for
customers; therefore they must be able to
predict which features should be bundled to
meet customer demand.
• Warranties:
Manufacturers need to predict the number of
customers who will submit warranty claims
and the average cost of those claims.
Issues and Challenges
• Large data
– Number of variables (features), number of cases (examples)
– Multi gigabyte, terabyte databases
– Efficient algorithms, parallel processing
• High dimensionality
– Large number of features: exponential increase in search space
– Potential for spurious patterns
– Dimensionality reduction
• Over fitting
– Models noise in training data, rather than just the general patterns
• Changing data, missing and noisy data
• Use of domain knowledge
– Utilizing knowledge on complex data relationships, known facts
• Understandability of patterns
Success Stories