Finalestkddfinalpresentation 111207021040 Phpapp01

KDD: A Definition
• KDD is the automatic extraction of non-obvious,

hidden knowledge from large volumes of data.
Then run Data

Mining algorithms
106-1012 bytes:
we never see the What is the knowledge?
whole data set, so will How to represent
put it in the memory of and use it?
computers
Why do we need KDD ?
Some Data Overload Examples:
Science
Wal-Mart records 20 millions per day
Retail Marketing
Data
Health care transactions: multi-gigabyte Overload
databases
Mobil Oil: geological data of over 100 Healthcare Finance

terabytes
Data is the most Important tool to gain a competitive edge by

providing improved, customized services.
Knowledge Discovery Process
Integration
Interpretation Knowledge
& Evaluation
Knowledge
Raw
Dat __ __ __
Patterns
Understanding
__ __ __
a __ __ __ and
Rules
Transformed
DATA Target Data
Ware Data
house
Knowledge Discovery in Database
• Knowledge discovery in databases (KDD) is the non-trivial

process of identifying valid, potentially useful and ultimately
understandable patterns in data
Clean, Data Training Data

Collect, Data Data Mining
Preparation
Summarize Warehouse
Verification, Model
Operational Evaluation Patterns
Databases
Goals
Data Selection, Acquisition & Integration
Data Cleaning
Data Reduction & Projection
Matching the Goals
Exploratory Data Analysis
Data Mining
Interpretation and Testing
Consolidation & Use

• Goals STEP – 1: IDENTIFYING THE GOAL

• Data Selection,
Acquisition & Integration
• First step is developing an understanding of
• Data Cleaning the application domain and the relevant
• Data reduction and prior knowledge and identifying the goal of
Projection the KDD process from the customer’s
•Matching the goals viewpoint.
• Exploratory Data
Analysis
• Data Mining
•Interpretation and
Testing
• Consolidation & Use
• Goals STEP – 2: CREATING A TARGET DATA SET

• Data Selection,
• Selecting a data set, or focusing on a subset
• Data Cleaning of variables or data samples, on which
• Data reduction and discovery is to be performed.
Projection
•Matching the goals
Analysis
• Data Mining
Testing
• Goals STEP – 3: DATA CLEANING AND PREPROCESSING

• Data Selection,
Acquisition & Integration • Basic operations include removing noise if
• Data Cleaning appropriate, collecting the necessary
• Data reduction and information to model or account for noise,
Projection deciding on strategies for handling missing
•Matching the goals data fields, and accounting for time-
• Exploratory Data sequence information and known changes.
Analysis
• Data Mining
Testing
• Goals
STEP – 4: DATA REDUCTION AND
• Data Selection, PROJECTION
• Data Cleaning • Finding useful features to represent the data
• Data reduction and depending on the goal of the task.
Projection • With dimensionality reduction or
•Matching the goals transformation methods, the effective
number of variables under consideration can
Analysis
• Data Mining
be reduced, or invariant representations for
•Interpretation and the data can be found.
Testing
• Goals STEP – 5: MATCHING THE GOALS

• Data Selection,
Acquisition & Integration • Matching the goals of the KDD process to a
• Data Cleaning
particular data-mining method such as
• Data reduction and
summarization, classification, regression,
Projection
•Matching the goals clustering, etc.
Analysis
• Data Mining
Testing
• Goals
STEP – 6: EXPLORATORY ANALYSIS AND
• Data Selection, MODEL & HYPOTHESIS SELECTION
• Data Cleaning • Choosing the data mining algorithms and
• Data reduction and selecting methods to be used for searching
Projection for data patterns.
•Matching the goals • This process includes deciding which models
and parameters might be appropriate and
Analysis
• Data Mining
matching a particular data-mining method
• Interpretation and with the overall criteria of the KDD process.
Testing
• Goals STEP – 7: DATA MINING

• Data Selection,
• Searching for patterns of interest in a
• Data Cleaning particular representational form or a set of
• Data reduction and such representations, including classification
Projection rules or trees, regression, and clustering.
•Matching the goals • The user can significantly aid the data-
• Exploratory Data mining method by correctly performing the
Analysis preceding steps.
• Data Mining
Testing
• Goals STEP – 8: INTERPRETATION & TESTING

• Data Selection,
• Interpreting mined patterns, possibly
• Data Cleaning returning to any of steps 1 through 7 for
• Data reduction and further iteration.
Projection • This step can also involve visualization of the
•Matching the goals extracted patterns and models or
• Exploratory Data visualization of the data given the extracted
Analysis models.
• Data Mining
Testing
• Goals STEP – 9: KNOWLEDGE PRESENTATION

• Data Selection,
• Using the knowledge directly, incorporating
• Data Cleaning the knowledge into another system for
• Data reduction and further action, or simply documenting it and
Projection reporting it to interested parties.
•Matching the goals • This process also includes checking for and
• Exploratory Data resolving potential conflicts with previously
Analysis believed (or extracted) knowledge.
• Data Mining
• Testing and Verification
• Interpretation
Data Warehousing
• A platform for online analytical processing (OLAP)

• Warehouses collect transactional data from several
transactional databases and organize them in a fashion
amenable to analysis
• Also called “data marts”
• A critical component of the decision support system (DSS) of
enterprises
• Some typical DW queries:
– Which item sells best in each region that has retail outlets?
– Which advertising strategy is best for Dubai Markets?
Data Warehousing
OLTP
Data Cleaning
Inventory
Data
Warehouse
(OLAP)
Data Cleaning
• Performs logical transformation of transactional data to suit the data
warehouse
• Model of operations  model of enterprise
• Usually a semi-automatic process
Data Warehouse
Orders
Order_id Customers
Price Products
Cust_id Orders
Inventory
Price
Inventory
Sales Time
Prod_id
Cust_id
Price
Cust_profit
Price_change
Total_sales
Primary Tasks of Data Mining
finding the description
identifying a finite
of several predefined
set of categories or
classes and classify
clusters to describe
a data item into one
the data.
of them. Clustering
Classification
finding a model
maps a data item which describes
? significant dependencies
to a real-valued
prediction variable. between variables.
Regression Dependency
Modeling
discovering the finding a
most significant compact description
changes in the data for a subset of data
Deviation and
change detection
Summarization
Data Mining Algorithm Components
• Model representation
– descriptions of discovered patterns
– overly limited representation -- unable to capture data patterns
too powerful -- potential for over fit.
(decision trees, rules, linear/non-linear regression & classification,
nearest neighbor and case-based reasoning methods, graphical
dependency models)
• Model evaluation criteria

– how well a pattern (model) meets goals (fit function)
– e.g., accuracy, novelty, etc.
Data Mining Algorithm Components
• Search method
– parameter search: optimization of parameters for a given model
representation
– model search: considers a family of models
Different methods suit different problems. Proper problem formulation

crucial.
Data Mining Techniques
Data Mining Techniques
Descriptive Predictive
Clustering Classification
Association Decision Tree
Sequential Analysis Rule Induction
Neural Networks
Nearest Neighbor Classification
Regression
Association Rule: Application
• Supermarket Shelf Management

• Goal: to identify items which are bought together (by sufficiently many
customers)
• Approach: process point-of-sale data (collected with barcode scanners)
to find dependencies among items.
• Consider discovered rule:
{Diapers, Milk … } --> {Baby food}
• Example:
– If a customer buys Diapers and Milk, then he is very likely to buy
Baby foods.
– so stack baby foods next to diapers?
Sequential Pattern Discovery: Application
• Sequences in which customers purchase goods/services

• Understanding long term customer behavior -- timely
promotions.
• In point-of--sale transaction sequences

– Computer bookstore:
(Intro to Visual C++) (Java & J2EE) --> (Perl for Dummies, PHP in 24 Hrs)
– Athletic Apparel Store:

(Shoes) (Racket, Racket ball) --> (Sports Jacket)
Hierarchical Clustering (K-Means): Application
Hierarchical clustering: Clusters are formed at different levels by

merging clusters at a lower level
10
9
10
8 10
9
9
7
8
8
6
7
5 7
Update
6
6
4
Assign 5
5
the
3
2 each of 4
4
1
the
3 cluster 3
means
2
0 2
0 1 2 3 4 5 6 7 8 9 10 objects 1
1
to most 0
0 1 2 3 4 5 6 7 8 9 10
0
0 1 2 3 4 5 6 7 8 9 10
similar
center reassign
K=2
10
Arbitrarily choose K 9
objects as initial 8
cluster center Update

7
5 the
4
cluster
means
3
0
0 1 2 3 4 5 6 7 8 9 10
Decision Tree Identification: Application
Decision Tree Identification Example
Outlook Temp Play?

Sunny Warm Yes Sunny Yes
Overcast Chilly No
Sunny Chilly Yes
Cloudy Yes/No
Cloudy Pleasant Yes
Overcast Pleasant Yes
Overcast Yes/No
Overcast Chilly No
Cloudy Chilly No
Cloudy Warm Yes
Decision Tree Identification: Application
Yes/No
Cloudy Overcast
Sunny
Yes/No Yes Yes/No
Pleasant Chilly
Warm
Chilly
No Pleasant
Yes No Yes
Yes
Major Application Areas for Data
Mining (Classification)
• Advertising
• Bioinformatics
• Customer Relationship Management (CRM)
• Database Marketing
• Fraud Detection
• ecommerce
• Health Care
• Investment/Securities
• Manufacturing, Process Control
• Sports and Entertainment
• Telecommunications
• Web
Mining: Marketing
• Direct Marketing:
Most major direct marketing companies are using
modeling and data mining.
• Customer segmentation:
All industries can take advantage of DM to discover
discrete segments in their customer bases by considering
additional variables beyond traditional analysis.
• CRM:
Find other people in similar life stages and determine
which customers are following similar behavior patterns For e.g. Verizon
– Up-sell Wireless
– Cross-sell reduced churn
– Keeping the customers for a longer period of time rate from 2% to
1.5%
Mining: Fraud Detection
• Credit Card Fraud Detection

• Money laundering
– FAIS (US Treasury)
• Securities Fraud
– NASDAQ Sonar system
• Phone fraud
– AT&T, Bell Atlantic, British Telecom/MCI
• Bio-terrorism detection at Salt Lake
Olympics 2002
Mining: Retail
• Sales forecasting:
Examining time-based patterns helps retailers make
stocking decisions.
• Database Retailing:
Retailers can develop profiles of customers with
certain behaviors, for example, those who purchase
designer labels clothing or those who attend sales.
• Merchandise planning and allocation:

When retailers add new stores, they can improve
merchandise planning and allocation by examining
patterns in stores with similar demographic
characteristics.
Mining: Banking
• Credit Card marketing

By identifying customer segments, card
issuers and acquirers can improve
profitability with more effective acquisition
and retention programs.
• Cardholder pricing and profitability

Card issuers can take advantage of data
mining technology to price their products so
as to maximize profit and minimize loss of
customers.
Mining: Telecommunication
• Call detail record analysis:
Telecommunication companies accumulate
detailed call records. By identifying customer
segments with similar use patterns, the
companies can develop attractive pricing and
feature promotions.
• Customer loyalty:
Some customers repeatedly switch providers, or
“churn”, to take advantage of attractive incentives
by competing companies. The companies can use
DM to identify the characteristics of customers
who are likely to remain loyal once they switch,
thus enabling the companies to target their
spending on customers who will produce the most
profit.
Mining: Manufacturing
• Manufacturing:
Through choice boards, manufacturers are
beginning to customize products for
customers; therefore they must be able to
predict which features should be bundled to
meet customer demand.
• Warranties:
Manufacturers need to predict the number of
customers who will submit warranty claims
and the average cost of those claims.
Issues and Challenges
• Large data
– Number of variables (features), number of cases (examples)
– Multi gigabyte, terabyte databases
– Efficient algorithms, parallel processing
• High dimensionality
– Large number of features: exponential increase in search space
– Potential for spurious patterns
– Dimensionality reduction
• Over fitting
– Models noise in training data, rather than just the general patterns
• Changing data, missing and noisy data
• Use of domain knowledge
– Utilizing knowledge on complex data relationships, known facts
• Understandability of patterns
Success Stories
• Network intrusion detection using a combination of sequential

rule discovery and classification tree on 4 GB DARPA data
– Won over (manual) knowledge engineering approach
– http://www.cs.columbia.edu/~sal/JAM/PROJECT/ provides
good detailed description of the entire process
• Major US bank: customer attrition prediction
– First segment customers based on financial behavior: found 3
segments
– Build attrition models for each of the 3 segments
– 40-50% of attritions were predicted == factor of 18 increase
• Targeted credit marketing: major US banks
– Find customer segments based on 13 months credit balances
– Build another response model based on surveys
– Increased response 4 times -- 2%
Amitava Manna
(11DCP007)
Amritanshu Mehra
(11DCP008)
Animesh Ranjan
(11DCP009)

Finalestkddfinalpresentation 111207021040 Phpapp01

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Finalestkddfinalpresentation 111207021040 Phpapp01

Uploaded by

Copyright:

Available Formats

KDD: A Definition

• KDD is the automatic extraction of non-obvious,

Then run Data

Wal-Mart records 20 millions per day

Mobil Oil: geological data of over 100 Healthcare Finance

Data is the most Important tool to gain a competitive edge by

• Knowledge discovery in databases (KDD) is the non-trivial

Clean, Data Training Data

Data Selection, Acquisition & Integration

Data Reduction & Projection

Matching the Goals

Exploratory Data Analysis

Interpretation and Testing

Consolidation & Use

• Goals STEP – 1: IDENTIFYING THE GOAL

• Goals STEP – 2: CREATING A TARGET DATA SET

• Goals STEP – 3: DATA CLEANING AND PREPROCESSING

• Goals STEP – 5: MATCHING THE GOALS

• Goals STEP – 7: DATA MINING

• Goals STEP – 8: INTERPRETATION & TESTING

• Goals STEP – 9: KNOWLEDGE PRESENTATION

• A platform for online analytical processing (OLAP)

• Model evaluation criteria

Different methods suit different problems. Proper problem formulation

Data Mining Techniques

Association Decision Tree

Sequential Analysis Rule Induction

Nearest Neighbor Classification

• Supermarket Shelf Management

• Sequences in which customers purchase goods/services

• In point-of--sale transaction sequences

– Athletic Apparel Store:

Hierarchical clustering: Clusters are formed at different levels by

cluster center Update

Decision Tree Identification Example

Outlook Temp Play?

Yes/No Yes Yes/No

• Credit Card Fraud Detection

• Merchandise planning and allocation:

• Credit Card marketing

• Cardholder pricing and profitability

• Network intrusion detection using a combination of sequential

You might also like