Professional Documents
Culture Documents
Mining
process of extracting previously unknown, valid,
and actionable (understandable) information from
large databases
Data mining is a step in the KDD process of
applying data analysis and discovery algorithms
Machine learning, pattern recognition, statistics,
databases, data visualization.
Traditional techniques may be inadequate
large data
Affordable computing
Competitive pressure
gain an edge by providing improved, customized services
information as a product in its own right
Operational
Databases
Data
Warehouse
Data
Preparation
Training
Data
Verification,
Evaluation
Data
Mining
Model
Patterns
Search method
Data mining
100
90
80
70
60
50
40
30
20
10
0
Business
Objective
Determination
Data
Preparation
Data
Mining
Analysis of
Results and
Knowledge
Assimilation
high dimensionality
Overfitting
models noise in training data, rather than just the general patterns
Understandability of patterns
Data Mining
Prediction Methods
using some variables to predict unknown or future values of
other variables
Descriptive Methods
finding human-interpretable patterns describing the data
Classification
Clustering
Association Rule Discovery
Sequential Pattern Discovery
Regression
Deviation Detection
Classification
Data defined in terms of attributes, one of which is the class
Classification:Example
Clustering
Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that
data points in one cluster are more similar to one another
data points in separate clusters are less simislar to one
another.
Similarity measures
Euclidean distance if attributes are continuous
Problem specific measures
Association Rules:Application
Marketing and Sales Promotion:
Consider discovered rule:
{Bagels, } --> {Potato Chips}
Potato Chips as consequent: can be used to determine
what may be done to boost sales
Bagels as an antecedent: can be used to see which
products may be affected if bagels are discontinued
Can be used to see which products should be sold with
Bagels to promote sale of Potato Chips
Regression
Predict a value of a given continuous valued
variable (dependent variable) based on values of
other variables (independent variables)
Statistics, Neural networks, Genetic algorithms
Examples:
predicting sales volumes of new product based on
advertising expenditure
Time series prediction of stock market indices.
Visualization
complement to other DM techniques like
Segmentation,etc.
Hypothesis testing
transaction data may be insufficient
explore ideas about why customers might leave, and how to identify
e.g. Regular bi-weekly direct deposit ceases: new job and no longer using
direct deposits
got married and spouse used another bank: reduction in balance and
number if transactions, last-name change request
Data requirements
Careful attention to data generated by internal decisions:
bank started charging for debit card transactions that were free
bank turned down loan or credit increase request
Is the data available?
Clustering
unsupervised
HELOC
customers
DDA
customers
(~250K cases)
Example
Data
DDA history of loan balances over 3,6,9,12,18 months,
returned checks
demographic data (age, income, length of residence, etc.),
both internal and external
property data sourced externally (home purchase price,
loan-to-value ratio, etc.)
credit worthiness data
response to previous mailings
120 variables selected
less than half the DDAs had history records; missing fields;
(45 K cases remaining for use -- prospects database)
exclude variables like sex, race, age (legal restrictions)
Example
Training data
randomly sample from prospects database; weighted to
include more responders than present in actual data
Validation
rank on likelihood of response
consider top and bottom 10% -- use visualization, decision
tree to understand rationale for obtained classification
Testing
sample from prospects database; unweighted with normal
proportion of responders and non-responders
gains (lift) chart
Number
Number
Decile
Cum
Cum
of
of
Response Response Response
Customers Responses
Rate
Rate
Lift
4,617
865
18.7%
18.7%
411
4,617
382
8.3%
13.5%
296
4,617
290
6.3%
11.1%
244
4,617
128
2.8%
9.0%
198
4,617
97
2.1%
7.6%
167
4,617
81
1.8%
6.7%
146
4,617
79
1.7%
5.9%
130
4,617
72
1.6%
5.4%
118
4,617
67
1.5%
5.0%
109
4,617
43
0.9%
4.6%
100
46,170
2,104
4.6%