You are on page 1of 37

KDD: A Definition

• KDD is the automatic extraction of non-obvious,


hidden knowledge from large volumes of data.

Then run Data


Mining algorithms

106-1012 bytes:
we never see the What is the knowledge?
whole data set, so will How to represent
put it in the memory of and use it?
computers
Why do we need KDD ?
Some Data Overload Examples:
Science

Wal-Mart records 20 millions per day

Retail Marketing

Data
Health care transactions: multi-gigabyte Overload
databases

Mobil Oil: geological data of over 100 Healthcare Finance


terabytes

Data is the most Important tool to gain a competitive edge by


providing improved, customized services.
Knowledge Discovery Process
Integration

Interpretation Knowledge
& Evaluation

Knowledge
Raw
Dat __ __ __
Patterns

Understanding
__ __ __
a __ __ __ and
Rules
Transformed
DATA Target Data
Ware Data
house
Knowledge Discovery in Database

• Knowledge discovery in databases (KDD) is the non-trivial


process of identifying valid, potentially useful and ultimately
understandable patterns in data

Clean, Data Training Data


Collect, Data Data Mining
Preparation
Summarize Warehouse

Verification, Model
Operational Evaluation Patterns
Databases
Knowledge Discovery Process
Goals

Data Selection, Acquisition & Integration

Data Cleaning

Data Reduction & Projection

Matching the Goals

Exploratory Data Analysis

Data Mining

Interpretation and Testing

Consolidation & Use


Knowledge Discovery Process

• Goals STEP – 1: IDENTIFYING THE GOAL


• Data Selection,
Acquisition & Integration
• First step is developing an understanding of
• Data Cleaning the application domain and the relevant
• Data reduction and prior knowledge and identifying the goal of
Projection the KDD process from the customer’s
•Matching the goals viewpoint.
• Exploratory Data
Analysis
• Data Mining
•Interpretation and
Testing
• Consolidation & Use
Knowledge Discovery Process

• Goals STEP – 2: CREATING A TARGET DATA SET


• Data Selection,
Acquisition & Integration
• Selecting a data set, or focusing on a subset
• Data Cleaning of variables or data samples, on which
• Data reduction and discovery is to be performed.
Projection
•Matching the goals
• Exploratory Data
Analysis
• Data Mining
•Interpretation and
Testing
• Consolidation & Use
Knowledge Discovery Process

• Goals STEP – 3: DATA CLEANING AND PREPROCESSING


• Data Selection,
Acquisition & Integration • Basic operations include removing noise if
• Data Cleaning appropriate, collecting the necessary
• Data reduction and information to model or account for noise,
Projection deciding on strategies for handling missing
•Matching the goals data fields, and accounting for time-
• Exploratory Data sequence information and known changes.
Analysis
• Data Mining
•Interpretation and
Testing
• Consolidation & Use
Knowledge Discovery Process

• Goals
STEP – 4: DATA REDUCTION AND
• Data Selection, PROJECTION
Acquisition & Integration
• Data Cleaning • Finding useful features to represent the data
• Data reduction and depending on the goal of the task.
Projection • With dimensionality reduction or
•Matching the goals transformation methods, the effective
• Exploratory Data
number of variables under consideration can
Analysis
• Data Mining
be reduced, or invariant representations for
•Interpretation and the data can be found.
Testing
• Consolidation & Use
Knowledge Discovery Process

• Goals STEP – 5: MATCHING THE GOALS


• Data Selection,
Acquisition & Integration • Matching the goals of the KDD process to a
• Data Cleaning
particular data-mining method such as
• Data reduction and
summarization, classification, regression,
Projection
•Matching the goals clustering, etc.
• Exploratory Data
Analysis
• Data Mining
•Interpretation and
Testing
• Consolidation & Use
Knowledge Discovery Process

• Goals
STEP – 6: EXPLORATORY ANALYSIS AND
• Data Selection, MODEL & HYPOTHESIS SELECTION
Acquisition & Integration
• Data Cleaning • Choosing the data mining algorithms and
• Data reduction and selecting methods to be used for searching
Projection for data patterns.
•Matching the goals • This process includes deciding which models
• Exploratory Data
and parameters might be appropriate and
Analysis
• Data Mining
matching a particular data-mining method
• Interpretation and with the overall criteria of the KDD process.
Testing
• Consolidation & Use
Knowledge Discovery Process

• Goals STEP – 7: DATA MINING


• Data Selection,
Acquisition & Integration
• Searching for patterns of interest in a
• Data Cleaning particular representational form or a set of
• Data reduction and such representations, including classification
Projection rules or trees, regression, and clustering.
•Matching the goals • The user can significantly aid the data-
• Exploratory Data mining method by correctly performing the
Analysis preceding steps.
• Data Mining
•Interpretation and
Testing
• Consolidation & Use
Knowledge Discovery Process

• Goals STEP – 8: INTERPRETATION & TESTING


• Data Selection,
Acquisition & Integration
• Interpreting mined patterns, possibly
• Data Cleaning returning to any of steps 1 through 7 for
• Data reduction and further iteration.
Projection • This step can also involve visualization of the
•Matching the goals extracted patterns and models or
• Exploratory Data visualization of the data given the extracted
Analysis models.
• Data Mining
•Interpretation and
Testing
• Consolidation & Use
Knowledge Discovery Process

• Goals STEP – 9: KNOWLEDGE PRESENTATION


• Data Selection,
Acquisition & Integration
• Using the knowledge directly, incorporating
• Data Cleaning the knowledge into another system for
• Data reduction and further action, or simply documenting it and
Projection reporting it to interested parties.
•Matching the goals • This process also includes checking for and
• Exploratory Data resolving potential conflicts with previously
Analysis believed (or extracted) knowledge.
• Data Mining
• Testing and Verification
• Interpretation
• Consolidation & Use
Data Warehousing

• A platform for online analytical processing (OLAP)


• Warehouses collect transactional data from several
transactional databases and organize them in a fashion
amenable to analysis
• Also called “data marts”
• A critical component of the decision support system (DSS) of
enterprises
• Some typical DW queries:
– Which item sells best in each region that has retail outlets?
– Which advertising strategy is best for Dubai Markets?
Data Warehousing

OLTP

Data Cleaning

Inventory

Data
Warehouse
(OLAP)
Data Cleaning
• Performs logical transformation of transactional data to suit the data
warehouse
• Model of operations  model of enterprise
• Usually a semi-automatic process
Data Warehouse
Orders
Order_id Customers
Price Products
Cust_id Orders
Inventory
Price
Inventory
Sales Time
Prod_id
Cust_id
Price
Cust_profit
Price_change
Total_sales
Primary Tasks of Data Mining
finding the description
identifying a finite
of several predefined
set of categories or
classes and classify
clusters to describe
a data item into one
the data.
of them. Clustering
Classification
finding a model
maps a data item which describes
? significant dependencies
to a real-valued
prediction variable. between variables.
Regression Dependency
Modeling
discovering the finding a
most significant compact description
changes in the data for a subset of data
Deviation and
change detection
Summarization
Data Mining Algorithm Components
• Model representation
– descriptions of discovered patterns
– overly limited representation -- unable to capture data patterns
too powerful -- potential for over fit.
(decision trees, rules, linear/non-linear regression & classification,
nearest neighbor and case-based reasoning methods, graphical
dependency models)

• Model evaluation criteria


– how well a pattern (model) meets goals (fit function)
– e.g., accuracy, novelty, etc.
Data Mining Algorithm Components
• Search method
– parameter search: optimization of parameters for a given model
representation
– model search: considers a family of models

Different methods suit different problems. Proper problem formulation


crucial.
Data Mining Techniques

Data Mining Techniques

Descriptive Predictive

Clustering Classification

Association Decision Tree

Sequential Analysis Rule Induction

Neural Networks

Nearest Neighbor Classification

Regression
Association Rule: Application

• Supermarket Shelf Management


• Goal: to identify items which are bought together (by sufficiently many
customers)
• Approach: process point-of-sale data (collected with barcode scanners)
to find dependencies among items.
• Consider discovered rule:
{Diapers, Milk … } --> {Baby food}
• Example:
– If a customer buys Diapers and Milk, then he is very likely to buy
Baby foods.
– so stack baby foods next to diapers?
Sequential Pattern Discovery: Application

• Sequences in which customers purchase goods/services


• Understanding long term customer behavior -- timely
promotions.

• In point-of--sale transaction sequences


– Computer bookstore:
(Intro to Visual C++) (Java & J2EE) --> (Perl for Dummies, PHP in 24 Hrs)

– Athletic Apparel Store:


(Shoes) (Racket, Racket ball) --> (Sports Jacket)
Hierarchical Clustering (K-Means): Application

Hierarchical clustering: Clusters are formed at different levels by


merging clusters at a lower level
10

9
10
8 10

9
9
7
8
8
6
7
5 7

Update
6
6
4
Assign 5
5

the
3

2 each of 4
4

1
the
3 cluster 3

means
2
0 2
0 1 2 3 4 5 6 7 8 9 10 objects 1
1

to most 0
0 1 2 3 4 5 6 7 8 9 10
0
0 1 2 3 4 5 6 7 8 9 10

similar
center reassign
K=2
10
Arbitrarily choose K 9

objects as initial 8

cluster center Update


7

5 the
4
cluster
means
3

0
0 1 2 3 4 5 6 7 8 9 10
Decision Tree Identification: Application

Decision Tree Identification Example

Outlook Temp Play?


Sunny Warm Yes Sunny Yes
Overcast Chilly No
Sunny Chilly Yes
Cloudy Yes/No
Cloudy Pleasant Yes
Overcast Pleasant Yes
Overcast Yes/No
Overcast Chilly No
Cloudy Chilly No
Cloudy Warm Yes
Decision Tree Identification: Application

Yes/No

Cloudy Overcast
Sunny

Yes/No Yes Yes/No

Pleasant Chilly
Warm
Chilly
No Pleasant
Yes No Yes

Yes
Major Application Areas for Data
Mining (Classification)
• Advertising
• Bioinformatics
• Customer Relationship Management (CRM)
• Database Marketing
• Fraud Detection
• ecommerce
• Health Care
• Investment/Securities
• Manufacturing, Process Control
• Sports and Entertainment
• Telecommunications
• Web
Major Application Areas for Data
Mining: Marketing
• Direct Marketing:
Most major direct marketing companies are using
modeling and data mining.
• Customer segmentation:
All industries can take advantage of DM to discover
discrete segments in their customer bases by considering
additional variables beyond traditional analysis.
• CRM:
Find other people in similar life stages and determine
which customers are following similar behavior patterns For e.g. Verizon
– Up-sell Wireless
– Cross-sell reduced churn
– Keeping the customers for a longer period of time rate from 2% to
1.5%
Major Application Areas for Data
Mining: Fraud Detection

• Credit Card Fraud Detection


• Money laundering
– FAIS (US Treasury)
• Securities Fraud
– NASDAQ Sonar system
• Phone fraud
– AT&T, Bell Atlantic, British Telecom/MCI
• Bio-terrorism detection at Salt Lake
Olympics 2002
Major Application Areas for Data
Mining: Retail
• Sales forecasting:
Examining time-based patterns helps retailers make
stocking decisions.

• Database Retailing:
Retailers can develop profiles of customers with
certain behaviors, for example, those who purchase
designer labels clothing or those who attend sales.

• Merchandise planning and allocation:


When retailers add new stores, they can improve
merchandise planning and allocation by examining
patterns in stores with similar demographic
characteristics.
Major Application Areas for Data
Mining: Banking

• Credit Card marketing


By identifying customer segments, card
issuers and acquirers can improve
profitability with more effective acquisition
and retention programs.

• Cardholder pricing and profitability


Card issuers can take advantage of data
mining technology to price their products so
as to maximize profit and minimize loss of
customers.
Major Application Areas for Data
Mining: Telecommunication
• Call detail record analysis:
Telecommunication companies accumulate
detailed call records. By identifying customer
segments with similar use patterns, the
companies can develop attractive pricing and
feature promotions.

• Customer loyalty:
Some customers repeatedly switch providers, or
“churn”, to take advantage of attractive incentives
by competing companies. The companies can use
DM to identify the characteristics of customers
who are likely to remain loyal once they switch,
thus enabling the companies to target their
spending on customers who will produce the most
profit.
Major Application Areas for Data
Mining: Manufacturing

• Manufacturing:
Through choice boards, manufacturers are
beginning to customize products for
customers; therefore they must be able to
predict which features should be bundled to
meet customer demand.

• Warranties:
Manufacturers need to predict the number of
customers who will submit warranty claims
and the average cost of those claims.
Issues and Challenges
• Large data
– Number of variables (features), number of cases (examples)
– Multi gigabyte, terabyte databases
– Efficient algorithms, parallel processing
• High dimensionality
– Large number of features: exponential increase in search space
– Potential for spurious patterns
– Dimensionality reduction
• Over fitting
– Models noise in training data, rather than just the general patterns
• Changing data, missing and noisy data
• Use of domain knowledge
– Utilizing knowledge on complex data relationships, known facts
• Understandability of patterns
Success Stories

• Network intrusion detection using a combination of sequential


rule discovery and classification tree on 4 GB DARPA data
– Won over (manual) knowledge engineering approach
– http://www.cs.columbia.edu/~sal/JAM/PROJECT/ provides
good detailed description of the entire process
• Major US bank: customer attrition prediction
– First segment customers based on financial behavior: found 3
segments
– Build attrition models for each of the 3 segments
– 40-50% of attritions were predicted == factor of 18 increase
• Targeted credit marketing: major US banks
– Find customer segments based on 13 months credit balances
– Build another response model based on surveys
– Increased response 4 times -- 2%
Amitava Manna
(11DCP007)
Amritanshu Mehra
(11DCP008)
Animesh Ranjan
(11DCP009)

You might also like