Professional Documents
Culture Documents
Mining
Some slide material taken from: Groth, Han and Kamber, SAS Institute
The UNT/SAS® joint Data Mining
Certificate: New in 2006
• Just approved!
• Free of charge!
• Requires:
– DSCI 2710
– DSCI 3710 L E
MP
– BCIS 4660
SA
– DSCI 4520
Overview of this Presentation
ca tu
rd rn
s
ch
ar
ge
s
6
DM and Business Decision Support
– Database Marketing
• Target marketing
• Customer relationship management
– Credit Risk Management
• Credit scoring
– Fraud Detection
– Healthcare Informatics
• Clinical decision support
Multidisciplinary
Statistics
Pattern Neurocomputing
Recognition
Machine
Data Mining Learning AI
Databases
KDD
On the News:
Data-mining software digs for business leads
How does it work? Both Spoke and Visible Path send so-called crawlers around a corporation's
internal computer network -- sniffing telltale clues, say, from employee Outlook files about who
they e-mail and how often, who replies to particular messages and who doesn't, which names show up
in electronic calendars and phone logs. Then it cross-references those snippets with information
from other company databases, including sales records from PeopleSoft and Salesforce.com.
Data Mining: A KDD Process
Pattern Evaluation
– Data mining: the core of
knowledge discovery process.
Data Mining
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Databases
Data Mining and Business
Intelligence
Increasing potential
to support End User
business decisions Making
(Manager)
Decisions
Data Exploration
Statistical Analysis, Querying and Reporting
Pattern evaluation
Data
Databases Warehouse
Introducing
SAS Enterprise Miner (EM)
The SEMMA Methodology
Sampling
Data Partition
Explore
Distribution Association
Explorer
Data Set
Attributes
Clustering
Transform
Variables Self-Organized Maps
Kohonen Networks
Filter
Outliers
Time Series
Replacement
Model
Regression User Defined
Model
Tree Ensemble
Assessment
Reporter
Other Types of Nodes – Scoring
Nodes, Utility Nodes
Group Processing
Subdiagram
DATA MINING AT WORK:
Detecting Credit Card Fraud
• HMEQ Overview
• Determine who should be
approved for a home equity loan.
• The target variable is a binary
variable that indicates whether
an applicant eventually defaulted
on the loan.
• The input variables are variables
such as the amount of the loan,
amount due on the existing
mortgage, the value of the
property, and the number of
recent credit inquiries.
HMEQ case overview
– The consumer credit department of a bank wants to automate
the decision-making process for approval of home equity lines
of credit. To do this, they will follow the recommendations of
the Equal Credit Opportunity Act to create an empirically
derived and statistically sound credit scoring model. The model
will be based on data collected from recent applicants granted
credit through the current process of loan underwriting. The
model will be built from predictive modeling tools, but the
created model must be sufficiently interpretable so as to
provide a reason for any adverse actions (rejections).
– The HMEQ data set contains baseline and loan performance
information for 5,960 recent home equity loans. The target
(BAD) is a binary variable that indicates if an applicant
eventually defaulted or was seriously delinquent. This adverse
outcome occurred in 1,189 cases (20%). For each applicant, 12
input variables were recorded.
The HMEQ Loan process
1. An applicant comes forward with a specific
property and a reason for the loan (Home-
Improvement, Debt-Consolidation)
2. Background info related to job and credit
history is collected
3. The loan gets approved or rejected
4. Upon approval, the Applicant becomes a
Customer
5. Information related to how the loan is serviced
is maintained, including the Status of the loan
(Current, Delinquent, Defaulted, Paid-Off)
The HMEQ Loan
Transactional Database
• Entity Relationship Diagram (ERD), Logical Design:
Loan Reason
Date Approval
Applies for
APPLICANT HMEQ Loan on… PROPERTY
using…
becomes Balance
Status
OFFICER
CUSTOMER ACCOUNT
MonthlyPayment
has
HISTORY
HMEQ Transactional database:
the relations
• Entity Relationship Diagram (ERD), Physical Design:
Officer HMEQLoanApplication
OFFICERID OFFICERID
OFFICERNAME APPLICANTID Property
PHONE PROPERTYID
PROPERTYID
FAX LOAN
ADDRESS
REASON
VALUE
DATE
MORTDUE
APPROVAL
Applicant
APPLICANTID
NAME Customer Account
JOB
CUSTOMERID ACCOUNTID History
DEBTINC
APPLICANTID CUSTOMERID
YOJ HISTORYID
NAME PROPERTYID
DEROG ACCOUNTID
ADDRESS ADDRESS
CLNO PAYMENT
BALANCE
DELINQ DATE
MONTHLYPAYMENT
CLAGE
STATUS
NINQ
The HMEQ Loan
Data Warehouse Design
• We have some slowly changing attributes:
HMEQLoanApplication: Loan, Reason, Date
Applicant: Job and Credit Score related attributes
Property: Value, Mortgage, Balance
• An applicant may reapply for a loan, then
some of these attributes may have
changed.
– Need to introduce “Key” attributes and make
them primary keys
The HMEQ Loan
Data Warehouse Design
STAR 1 – Loan Application facts
• Fact Table: HMEQApplicationFact
• Dimensions: Applicant, Property, Officer, Time
Logistic Regression
Modeling Techniques:
Separate Sampling
•
Benefits:
• Helps detect rare target levels
• Speeds processing
Risks:
• Biases predictions (correctable)
• Increases prediction variability
Logistic Regression Models
logit(p )
log(odds)
p
( )
log g-1( p ) = w0 + w1x1 +…+ wpxp
1-p
1.0 logit(p)
p 0.5
0.0
0
Training Data
Changing the Odds
p
log (1 - p ) = w0 + w1x1 +…+ wpxp
p´ p
log (
1 - p´ )
= wexp( log
w10++www011+ (
)(xw1+1)+…+)
1x1 +…+ w
1-p
wppxxpp
odds
ratio
Training Data
Modeling Tools
Decision Trees
Divide and Conquer the
HMEQ data
n = 5,000
The tree is fitted to the data by
recursive partitioning.
Partitioning refers to
segmenting the data into
subgroups that are as 10% BAD
homogeneous as possible with yes no
respect to the target. In this Debt-to-Income
case, the binary split (Debt-to- n = 3,350 Ratio < 45 n = 1,650
Income Ratio < 45) was
chosen. The 5,000 cases were
split into two groups, one with
a 5% BAD rate and the other 5% BAD 21% BAD
with a 21% BAD rate.
The method is recursive because each subgroup results from splitting a subgroup
from a previous split. Thus, the 3,350 cases in the left child node and the 1,650
cases in the right child node are split again in similar fashion.
The Cultivation of Trees
– Split Search
•Which splits are to be considered?
– Splitting Criterion
•Which split is best?
– Stopping Rule
•When should the splitting stop?
– Pruning Rule
•Should some branches be lopped off?
Possible Splits to Consider:
an enormous number
500,000
400,000 Nominal
Input Ordinal
300,000
Input
200,000
100,000
1
2 4 6 8 10 12 14 16 18 20
Input Levels
Splitting Criteria
– Interpretability
• tree-structured presentation
– Mixed Measurement Scales
• nominal, ordinal, interval
– Robustness (tolerance to noise)
– Handling of Missing Values
– Regression trees, Consolidation
trees
Modeling Tools
Neural Networks
Neural network models
(multi-layer perceptrons)
Often regarded as a mysterious and powerful predictive
modeling technique.
The most typical form of the model is, in fact, a natural
extension of a regression model:
• A generalized linear model on a set of derived inputs
• These derived inputs are themselves a generalized linear model
on the original inputs
The usual link for the derived input’s model is inverse
hyperbolic tangent, a shift and rescaling of the logit
function
Ability to approximate virtually any continuous
association between the inputs and the target
• You simply need to specify the correct number of derived inputs
Neural Network Model
p
x2 log (1 - p ) = w 00 + w01H1 + w02H2 + w03H3
0 x
x1 -1
Training Data
Input layer, hidden layer, output
layer
p
x2 log (1 - p ) = w 00 + w01H1 + w02H2 + w03H3
cases
Where:
– N is the number of training cases.
– yi is the target value of the ith case.
– ŷ i is the predicted target value.
– ŵ is the current estimate of the model parameters.
Overgeneralization
p
x2 log (1 - p ) = w 00 + w01H1 + w02H2 + w03H3
x1 0 10 20 30 40 50 60 70
Training Data
Final Model
p
x2 log (1 - p ) = w 00 + w01H1 + w02H2 + w03H3
Profit
x1 0 10 20 30 40 50 60 70
Training Data