You are on page 1of 19

4

COIS 448: Data Mining &


Business Intelligence

Overview of Data Mining Methods

Information Systems Department


Faculty of Computing and Information Technology Rabigh
King Abdulaziz University
2 Data mining applications

 Automobile insurance company: Fraud detection


 Business applications: loan evaluation, customer segmentation,
employee evaluation…
 Data mining tools categorized by the tasks of classification,
estimation, prediction, clustering, and summarization.
 Classification, estimation, prediction are predictive, while
clustering and summarization are descriptive.
History
3
Statistics
AI:
genetic algorithms, neural networks
analogies with biology
memory-based reasoning
link analysis from graph theory
4
Data mining perspectives

 Methods can be viewed from different perspectives, data mining methods include:
 Market Basket Analysis
 Classification analysis
 Clustering analysis
 Regression of various forms
 AI:
 Artificial Neural Network (ANN)
 Rule induction (decision trees)
 Genetic algorithms (supplement)
Techniques
5
Statistical
 Market-Basket Analysis - find groups of items
 Memory-Based Reasoning- case based
 Cluster Detection - undirected (quantitative)
Artificial Intelligence
 Link Analysis - MCI’s Friends & Family
 Decision Trees, Rule Induction - production rule
 Neural Networks - automatic pattern detection
 Genetic Algorithms - keep best parameters
Models
6
Regression: Y = a + bX
Classification: assign new record to class
Predictive: assign value to new record
Clustering: groups for data
Time-series: assign future value
Links: patterns in data
Fitting
7
Underfitting: not enough detail
leave out important variables
Overfitting: too much detail
memorizes training set, but doesn’t help with
new data
data set too small
redundancy in data
Comparison of Features
8

Rules Neural Net CaseBase Genetic

Noisy data Good Very good Good Very good

Missing data Good Good Very good Good

Large sets Very good Poor Good Good

Different types Good Numerical Very good Transform

Accuracy High Very high High High

Explanation Very good Poor Very good Good

Integration Good Good Good Very good

Ease Easy Difficult Easy Difficult


Data Mining Functions
9
Classification
 Identify categories in data
Prediction
 Formula to predict future observations
Association
 Rules using relationships among entities
Detection
 Anomalies (unusual) & irregularities (fraud detection)
Financial Applications
10
Technique Application Problem Type

Neural net Forecast stock price Prediction


Forecast bankruptcy Prediction
NN, Rule
Fraud detection Detection
NN, Case Forecast interest rate Prediction

NN, visual Late loan detection Detection


Credit assessment Prediction
Rule
Risk classification Classification
Rule, Case Corporate bond rate ( 公司債 ) Prediction
11
Telecom Applications

Technique Application Problem Type


Neural net, Forecast network
Prediction
Rule induction behavior.
Churn Classification
Rule induction
Fraud detection Detection
Case based Call tracking Classification
Marketing Applications
12

Technique Application Problem Type


Market segment Classification
Rule induction
Cross-selling Association
Lifestyle analysis
Rule induction, Classification
visual Performance
Association
analysis.
Rule induction, Reaction to
Prediction
genetic, visual promotion
Case based Online sales support Classification
Web Applications
13

Technique Application Problem Type


Rule induction, User browsing Classification,
Visualization similarity analysis. Association
Rule-based Web page content
Association
heuristics similarity
Other Applications
14
Technique Application Problem Type
Neural net Software cost Detection
Neural net,
Litigation assessment Prediction
rule induction
Insurance fraud Detection
Rule induction
Healthcare except. Detection
Prediction
Insurance claim
Case based Classification
Software quality

Genetic algorithm Budget spending Classification


Data Sets
15
Loan Applications
 classification
Job Applications
 classification
Insurance Fraud
 detection
Expenditure Data
 prediction
Loan Data
16
650 observations
OUTCOMES (binary):
 On-time cost of error: $300
 Late (default) cost of error: $2,000
Variables
 Age, Income, Assets, Debts, Want, Credit
Credit ordinal
 Transform: Assets, Debts, & Want →Risk
Job Application Data
17
500 observations
OUTCOMES (ordinal):
 Unacceptable
 Minimal
 Acceptable
 Excellent
Variables
 Age, State, Degree, Major, Experience
State nominal; degree & major ordinal
State is superfluous
Insurance Claim Data
18
5000 observations
OUTCOMES (binary):
 OK cost of error $500
 Fraudulent cost of error $2,500
Variables
 Age, Gender, Claim, Tickets, Prior claims, Attorney
Gender & attorney nominal, tickets & prior claims
categorical
Expenditure Data
19
10,000 observations
OUTCOMES:
 Could predict response in a number of categories
 Others
Variables:
 Age, Gender, Marital, Dependents, Income, Job
years, Town years, Education years, Drivers license,
Own home, Number of credit cards
 Churn, proportion of income spent on seven
categories

You might also like