Data Mining: Some Slide Material Taken From: Groth, Han and Kamber, SAS Institute

Data
Mining
Some slide material taken from: Groth, Han and Kamber, SAS Institute
The UNT/SAS® joint Data Mining
Certificate: New in 2006
• Just approved!
• Free of charge!
• Requires:
– DSCI 2710
– DSCI 3710 L E
MP
– BCIS 4660
SA
– DSCI 4520
Overview of this Presentation
• Introduction to Data Mining

• The SEMMA Methodology
• Regression/Logistic Regression
• Decision Trees
• Neural Networks
• SAS EM Demo: The Home Equity Loan Case
Important DM techniques Not Covered today:

• Market Basket Analysis
• Memory-Based Reasoning
• Web Link Analysis
Introduction to DM
“It is a capital mistake to theorize

before one has data.
Insensibly one begins to twist facts
to suit theories, instead of
theories to suit facts.”
(Sir Arthur Conan Doyle: Sherlock Holmes, "A Scandal in

Bohemia")
What Is Data Mining?
• Data mining (knowledge discovery in

databases):
– A process of identifying hidden patterns
and relationships within data (Groth)
• Data mining:
– Extraction of interesting (non-trivial,
implicit, previously unknown and
potentially useful) information or
patterns from data in large databases
ho
sp
st e le ita
oc ct lp
k ron at
ca
ta tra ic ie
lo
g
de p nt
or s oi
nt re
O - gi
re de L of st
m r s TP -s rie
ai ot al
e s
rli e b te
ne s en a nk le da
re si n ph ta
se g tra on
rv i m ns e
at a ac ca
io ge ti lls
ns s on
s
cr ta
ed x
it re
Data
ca tu
rd rn
s
ch
ar
ge
s
6
DM and Business Decision Support
– Database Marketing
• Target marketing
• Customer relationship management
– Credit Risk Management
• Credit scoring
– Fraud Detection
– Healthcare Informatics
• Clinical decision support
Multidisciplinary
Statistics
Pattern Neurocomputing
Recognition
Machine
Data Mining Learning AI
Databases
KDD
On the News:
Data-mining software digs for business leads
SAN FRANSCISCO, March 8, 2004. Last year,

Congress shut down Pentagon’s Total Information
Awareness programs. Now entrepreneurs are selling new
data- mining tools that do similar things on a smaller scale.
Spoke and Visible Path sell their software primarily to
corporations. The idea is to provide tools for finding helpful
business partners and making blind introductions --
allowing, say, a lawyer for Silly Computers Inc. to
electronically ask a former classmate from Harvard who
once did legal work for Microsoft to help him pitch a
business deal to Bill Gates.
How does it work? Both Spoke and Visible Path send so-called crawlers around a corporation's
internal computer network -- sniffing telltale clues, say, from employee Outlook files about who
they e-mail and how often, who replies to particular messages and who doesn't, which names show up
in electronic calendars and phone logs. Then it cross-references those snippets with information
from other company databases, including sales records from PeopleSoft and Salesforce.com.
Data Mining: A KDD Process
Pattern Evaluation
– Data mining: the core of
knowledge discovery process.
Data Mining
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Databases
Data Mining and Business
Intelligence
Increasing potential
to support End User
business decisions Making
(Manager)
Decisions
Data Presentation Business

Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts

OLAP, MDA DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
Architecture of a Typical Data
Mining System
Graphical user interface
Pattern evaluation
Data mining engine

Knowledge-base
Database or data
warehouse server
Data cleaning & data integration Filtering
Data
Databases Warehouse
Introducing
SAS Enterprise Miner (EM)
The SEMMA Methodology
– Introduced By SAS Institute

– Implemented in SAS Enterprise Miner (EM)
– Organizes a DM effort into 5 activity groups:
Sample
Explore
Modify
Model
Assess
Sample
Input Data Source
Sampling
Data Partition
Explore
Distribution Association
Explorer
Multiplot Variable Selection
Insight Link Analysis

Modify
Data Set
Attributes
Clustering
Transform
Variables Self-Organized Maps
Kohonen Networks
Filter
Outliers
Time Series
Replacement
Model
Regression User Defined
Model
Tree Ensemble
Neural Network Memory Based

Reasoning
Princomp/ Two-Stage Model

Dmneural
Assess
Assessment
Reporter
Other Types of Nodes – Scoring
Nodes, Utility Nodes
Group Processing
Data Mining Database

Score
SAS Code
C*Score
Control Point
Subdiagram
DATA MINING AT WORK:
Detecting Credit Card Fraud
– Credit card companies want to find a

way to monitor new transactions and
detect those made on stolen credit
cards. Their goal is to detect the fraud
while it is taking place.
– In a few weeks after each transaction
they will know which of the
transactions were fraudulent and which
were not, and they can then use this
data to validate their fraud detection
and prediction scheme.
Telstra Mobile Combats Churn with SAS®
As Australia's largest mobile service provider, Telstra Mobile is reliant
on highly effective churn management.
In most industries the cost of retaining a customer, subscriber or client
is substantially less than the initial cost of obtaining that customer.
Protecting this investment is the essence of churn management. It
really boils down to understanding customers -- what they want now
and what they're likely to want in the future, according to SAS.
"With SAS Enterprise Miner we can examine
customer behaviour on historical and
predictive levels, which can then show us
what 'group' of customers are likely to churn
and the causes," says Trish Berendsen,
Telstra Mobile's head of Customer
Relationship Management (CRM).
Reducing armed robberies in South Africa
SAS helped Absa, a Major South African Bank reduce armed
robberies by 41 percent over two years (2002-2003), netting a 38
percent reduction in cash loss and an 11 percent increase in
customer satisfaction ratings.
Absa, one of South Africa's largest banks, uses SAS' data mining
capabilities to leverage their data for better customer relationships
and more targeted marketing campaigns. With SAS analytics, the
bank can also track which branches are more likely to fall victim to a
robbery and take effective preventive measures.
"Absa used to be one of the banks that was

targeted more than other banks; now we're at the
bottom of the list," says Dave Donkin, Absa group
executive of e-business and information
management.
Strategic Pricing Solutions at MCI
MCI now has a solution for making strategic pricing decisions, driving
effective network analysis, enhancing segment reporting and creating
data for sales leader compensation.
Before implementing SAS, the process of inventorying MCI's
thousands of network platforms and IT systems – determining what
each one does, who runs them, how they help business and which
products they support – was completely manual. The model created
with SAS has helped MCI to catalog all that information and map the
details to products, customer segments and business processes.
"That's something everyone is excited about,"
says Leslie Mote, director of MCI corporate
business analysis. "Looking at the cost of a
system and what it relates to helps you see the
revenue you're generating from particular
products or customers. I can see what I'm doing
better."
Our own example:
The Home Equity Loan Case
• HMEQ Overview
• Determine who should be
approved for a home equity loan.
• The target variable is a binary
variable that indicates whether
an applicant eventually defaulted
on the loan.
• The input variables are variables
such as the amount of the loan,
amount due on the existing
mortgage, the value of the
property, and the number of
recent credit inquiries.
HMEQ case overview
– The consumer credit department of a bank wants to automate
the decision-making process for approval of home equity lines
of credit. To do this, they will follow the recommendations of
the Equal Credit Opportunity Act to create an empirically
derived and statistically sound credit scoring model. The model
will be based on data collected from recent applicants granted
credit through the current process of loan underwriting. The
model will be built from predictive modeling tools, but the
created model must be sufficiently interpretable so as to
provide a reason for any adverse actions (rejections).
– The HMEQ data set contains baseline and loan performance
information for 5,960 recent home equity loans. The target
(BAD) is a binary variable that indicates if an applicant
eventually defaulted or was seriously delinquent. This adverse
outcome occurred in 1,189 cases (20%). For each applicant, 12
input variables were recorded.
The HMEQ Loan process
1. An applicant comes forward with a specific
property and a reason for the loan (Home-
Improvement, Debt-Consolidation)
2. Background info related to job and credit
history is collected
3. The loan gets approved or rejected
4. Upon approval, the Applicant becomes a
Customer
5. Information related to how the loan is serviced
is maintained, including the Status of the loan
(Current, Delinquent, Defaulted, Paid-Off)
The HMEQ Loan
Transactional Database
• Entity Relationship Diagram (ERD), Logical Design:
Loan Reason
Date Approval
Applies for
APPLICANT HMEQ Loan on… PROPERTY
using…
becomes Balance
Status
OFFICER
CUSTOMER ACCOUNT
MonthlyPayment
has
HISTORY
HMEQ Transactional database:
the relations
• Entity Relationship Diagram (ERD), Physical Design:
Officer HMEQLoanApplication
OFFICERID OFFICERID
OFFICERNAME APPLICANTID Property
PHONE PROPERTYID
PROPERTYID
FAX LOAN
ADDRESS
REASON
VALUE
DATE
MORTDUE
APPROVAL
Applicant
APPLICANTID
NAME Customer Account
JOB
CUSTOMERID ACCOUNTID History
DEBTINC
APPLICANTID CUSTOMERID
YOJ HISTORYID
NAME PROPERTYID
DEROG ACCOUNTID
ADDRESS ADDRESS
CLNO PAYMENT
BALANCE
DELINQ DATE
MONTHLYPAYMENT
CLAGE
STATUS
NINQ
The HMEQ Loan
Data Warehouse Design
• We have some slowly changing attributes:
HMEQLoanApplication: Loan, Reason, Date
Applicant: Job and Credit Score related attributes
Property: Value, Mortgage, Balance
• An applicant may reapply for a loan, then
some of these attributes may have
changed.
– Need to introduce “Key” attributes and make
them primary keys
The HMEQ Loan
Data Warehouse Design
STAR 1 – Loan Application facts
• Fact Table: HMEQApplicationFact
• Dimensions: Applicant, Property, Officer, Time
STAR 2 – Loan Payment facts

• Fact Table: HMEQPaymentFact
• Dimensions: Customer, Property, Account, Time
Two Star Schemas for HMEQ Loans
Property Customer
Applicant
PROPERTYKEY CUSTOMERKEY
APPLICANTKEY PROPERTYID CUSTOMERID
APPLICANTID ADDRESS APPLICANTID
NAME VALUE NAME
JOB MORTDUE ADDRESS
DEBTINC
YOJ
HMEQApplicationFact HMEQPaymentFact
DEROG
CLNO APPLICANTKEY CUSTOMERKEY
DELINQ PROPERTYKEY PROPERTYKEY
CLAGE OFFICERKEY ACCOUNTKEY
NINQ TIMEKEY TIMEKEY
LOAN BALANCE
REASON PAYMENT
Officer APPROVAL STATUS
OFFICERKEY
OFFICERID Time Account
OFFICERNAME
TIMEKEY ACCOUNTKEY
PHONE
DATE LOAN
FAX
MONTH MATURITYDATE
YEAR MONTHLYPAYMENT
The HMEQ Loan DW:
Questions asked by management
• How many applications were filed each month during the
last year? What percentage of them were approved each
month?
• How has the monthly average loan amount been
fluctuating during the last year? Is there a trend?
• Which customers were delinquent in their loan payment
during the month of September?
• How many loans have defaulted each month during the
last year? Is there an increasing or decreasing trend?
• How many defaulting loans were approved last year by
each loan officer? Who are the officers with the largest
number of defaulting loans?
The HMEQ Loan DW:
Some more involved questions
• Are there any patterns suggesting which applicants are
more likely to default on their loan after it is approved?
• Can we relate loan defaults to applicant job and credit
history? Can we estimate probabilities to default based
on applicant attributes at the time of application? Are
there applicant segments with higher probability?
• Can we look at relevant data and build a predictive
model that will estimate such probability to default on the
HMEQ loan? If we make such a model part of our
business policy, can we decrease the percentage of
loans that eventually default by applying more stringent
loan approval criteria?
Selecting Task-relevant attributes
Property Customer
Applicant
PROPERTYKEY CUSTOMERKEY
APPLICANTKEY PROPERTYID CUSTOMERID
APPLICANTID ADDRESS APPLICANTID
NAME VALUE NAME
JOB MORTDUE ADDRESS
DEBTINC
YOJ
HMEQApplicationFact HMEQPaymentFact
DEROG
CLNO APPLICANTKEY CUSTOMERKEY
DELINQ PROPERTYKEY PROPERTYKEY
CLAGE OFFICERKEY ACCOUNTKEY
NINQ TIMEKEY TIMEKEY
LOAN BALANCE
REASON PAYMENT
Officer APPROVAL STATUS
OFFICERKEY
OFFICERID Time Account
OFFICERNAME
TIMEKEY ACCOUNTKEY
PHONE
DATE LOAN
FAX
MONTH MATURITYDATE
YEAR MONTHLYPAYMENT
HMEQ final task-relevant data file
Name Model Role Measurement Level Description

BAD Target Binary 1=defaulted on loan, 0=paid back loan
REASON Input Binary HomeImp=home improvement,

DebtCon=debt consolidation
JOB Input Nominal Six occupational categories
LOAN Input Interval Amount of loan request
MORTDUE Input Interval Amount due on existing mortgage
VALUE Input Interval Value of current property
DEBTINC Input Interval Debt-to-income ratio
YOJ Input Interval Years at present job
DEROG Input Interval Number of major derogatory reports
CLNO Input Interval Number of trade lines
DELINQ Input Interval Number of delinquent trade lines
CLAGE Input Interval Age of oldest trade line in months
NINQ Input Interval Number of recent credit inquiries

HMEQ: Modeling Goal
– The credit scoring model should compute the
probability of a given loan applicant to default
on loan repayment. A threshold is to be
selected such that all applicants whose
probability of default is in excess of the
threshold are recommended for rejection.
– Using the HMEQ task-relevant data file, three
competing models will be built: A logistic
Regression model, a Decision Tree, and a
Neural Network
– Model assessment will allow us to select the
best of the three alternative models
Predictive Modeling
Inputs
Target
...
...
...
...
...
Cases ...
...
...
...
.. .. .. .. .. .. .. .. .. .. ..
. . . . . . . . . . .
...
Modeling Tools
Logistic Regression
Modeling Techniques:
Separate Sampling
•
Benefits:
• Helps detect rare target levels
• Speeds processing
Risks:
• Biases predictions (correctable)
• Increases prediction variability
Logistic Regression Models
logit(p )
log(odds)
p
( )
log g-1( p ) = w0 + w1x1 +…+ wpxp
1-p
1.0 logit(p)
p 0.5
0.0
0
Training Data
Changing the Odds
p
log (1 - p ) = w0 + w1x1 +…+ wpxp
p´ p
log (
1 - p´ )
= wexp( log
w10++www011+ (
)(xw1+1)+…+)
1x1 +…+ w
1-p
wppxxpp
odds
ratio
Training Data
Modeling Tools
Decision Trees
Divide and Conquer the
HMEQ data
n = 5,000
The tree is fitted to the data by
recursive partitioning.
Partitioning refers to
segmenting the data into
subgroups that are as 10% BAD
homogeneous as possible with yes no
respect to the target. In this Debt-to-Income
case, the binary split (Debt-to- n = 3,350 Ratio < 45 n = 1,650
Income Ratio < 45) was
chosen. The 5,000 cases were
split into two groups, one with
a 5% BAD rate and the other 5% BAD 21% BAD
with a 21% BAD rate.
The method is recursive because each subgroup results from splitting a subgroup
from a previous split. Thus, the 3,350 cases in the left child node and the 1,650
cases in the right child node are split again in similar fashion.
The Cultivation of Trees
– Split Search
•Which splits are to be considered?
– Splitting Criterion
•Which split is best?
– Stopping Rule
•When should the splitting stop?
– Pruning Rule
•Should some branches be lopped off?
Possible Splits to Consider:
an enormous number
500,000
400,000 Nominal
Input Ordinal
300,000
Input
200,000
100,000
1
2 4 6 8 10 12 14 16 18 20
Input Levels
Splitting Criteria
How is the best split determined? In some situations, the worth

of a split is obvious. If the expected target is the same in the
child nodes as in the parent node, no improvement was made,
and the split is worthless!
In contrast, if a split results in pure child nodes, the split is

undisputedly best. For classification trees, the three most
widely used splitting criteria are based on the Pearson chi-
squared test, the Gini index, and entropy. All three measure
the difference in class distributions across the child nodes. The
three methods usually give similar results.
Benefits of Trees
– Interpretability
• tree-structured presentation
– Mixed Measurement Scales
• nominal, ordinal, interval
– Robustness (tolerance to noise)
– Handling of Missing Values
– Regression trees, Consolidation
trees
Modeling Tools
Neural Networks
Neural network models
(multi-layer perceptrons)
Often regarded as a mysterious and powerful predictive
modeling technique.
The most typical form of the model is, in fact, a natural
extension of a regression model:
• A generalized linear model on a set of derived inputs
• These derived inputs are themselves a generalized linear model
on the original inputs
The usual link for the derived input’s model is inverse
hyperbolic tangent, a shift and rescaling of the logit
function
Ability to approximate virtually any continuous
association between the inputs and the target
• You simply need to specify the correct number of derived inputs
Neural Network Model
p
x2 log (1 - p ) = w 00 + w01H1 + w02H2 + w03H3
tanh-1( H1 ) = w10 + w11x1 + w12x2

tanh-1( H2 ) = w20 + w21x1 + w22x2
tanh-1( H3 ) = w30 + w31x1 + w32x2
1 tanh(x)
0 x
x1 -1
Training Data
Input layer, hidden layer, output
layer
Multi-layer perceptron models were originally inspired by

neurophysiology and the interconnections between
neurons. The basic model form arranges neurons in
layers.
The input layer connects to a layer of neurons called a
hidden layer, which, in turn, connects to a final layer
called the target, or output, layer.
The structure of a multi-layer perceptron lends itself to a
graphical representation called a network diagram.
Neural Network Diagram
p
x2 log (1 - p ) = w 00 + w01H1 + w02H2 + w03H3
tanh-1( H1 ) = w10 + w11x1 + w12x2

tanh-1( H2 ) = w20 + w21x1 + w22x2
tanh-1( H3 ) = w30 + w31x1 + w32x2
Inputs Target
H1
x1
H2 p
x2
x1 H3
Training Data Hidden layer
Objective Function
Predictions are compared to the actual values of the target
via an objective function.
An easy-to-understand example of an objective function is
the mean squared error (MSE) given by:
1
MSE 
N
 l i
( y 
training
ˆ
y ( ˆ
w ) ) 2
cases
Where:
– N is the number of training cases.
– yi is the target value of the ith case.
– ŷ i is the predicted target value.
– ŵ is the current estimate of the model parameters.
Overgeneralization
A small value for the objective function, when calculated

on training data, need not imply a small value for the
function on validation data.
Typically, improvement on the objective function is
observed on both the training and the validation data
over the first few iterations of the training process.
At convergence, however, the model is likely to be highly
overgeneralized and the values of the objective
function computed on training and validation data may
be quite different.
Training Overgeneralization
p
x2 log (1 - p ) = w 00 + w01H1 + w02H2 + w03H3
tanh-1( H1 ) = w10 + w11x1 + w12x2

tanh-1( H2 ) = w20 + w21x1 + w22x2
tanh-1( H3 ) = w30 + w31x1 + w32x2
Objective function (w)
x1 0 10 20 30 40 50 60 70
Training Data
Final Model
To compensate for overgeneralization, the overall average

profit, computed on validation data, is examined.
The final parameter estimates for the model are taken
from the training iteration with the maximum validation
profit.
Neural Network Final Model
p
x2 log (1 - p ) = w 00 + w01H1 + w02H2 + w03H3
tanh-1( H1 ) = w10 + w11x1 + w12x2

tanh-1( H2 ) = w20 + w21x1 + w22x2
tanh-1( H3 ) = w30 + w31x1 + w32x2
Profit
x1 0 10 20 30 40 50 60 70
Training Data

Data Mining: Some Slide Material Taken From: Groth, Han and Kamber, SAS Institute

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining: Some Slide Material Taken From: Groth, Han and Kamber, SAS Institute

Uploaded by

Copyright:

Available Formats

Data

• Introduction to Data Mining

Important DM techniques Not Covered today:

“It is a capital mistake to theorize

(Sir Arthur Conan Doyle: Sherlock Holmes, "A Scandal in

• Data mining (knowledge discovery in

SAN FRANSCISCO, March 8, 2004. Last year,

Data Presentation Business

Data Warehouses / Data Marts

Graphical user interface

Data mining engine

– Introduced By SAS Institute

Input Data Source

Multiplot Variable Selection

Insight Link Analysis

Neural Network Memory Based

Princomp/ Two-Stage Model

Data Mining Database

– Credit card companies want to find a

"Absa used to be one of the banks that was

STAR 2 – Loan Payment facts

Name Model Role Measurement Level Description

REASON Input Binary HomeImp=home improvement,

JOB Input Nominal Six occupational categories

LOAN Input Interval Amount of loan request

MORTDUE Input Interval Amount due on existing mortgage

VALUE Input Interval Value of current property

DEBTINC Input Interval Debt-to-income ratio

YOJ Input Interval Years at present job

DEROG Input Interval Number of major derogatory reports

CLNO Input Interval Number of trade lines

DELINQ Input Interval Number of delinquent trade lines

CLAGE Input Interval Age of oldest trade line in months

NINQ Input Interval Number of recent credit inquiries

How is the best split determined? In some situations, the worth

In contrast, if a split results in pure child nodes, the split is

tanh-1( H1 ) = w10 + w11x1 + w12x2

Multi-layer perceptron models were originally inspired by

tanh-1( H1 ) = w10 + w11x1 + w12x2

A small value for the objective function, when calculated

tanh-1( H1 ) = w10 + w11x1 + w12x2

Objective function (w)

To compensate for overgeneralization, the overall average

tanh-1( H1 ) = w10 + w11x1 + w12x2

You might also like