You are on page 1of 66

ANL203

Analytics for Decision Making


Seminar 2
ECA – Generic Dataset
Pls note that the Generic Dataset provided for ECA serves as a guideline.

You can modify the dataset accordingly to suit your needs. For example, rename some of the variables, derive new
variables from the dataset or/and create your own variables.

If you do have a better dataset and will like to use that to build your dashboard, pls do so. You will need to state the
source of your dataset in your report.

IMPORTANT

Provide screenshots of the dashboard in your report, and store the Tableau-packaged workbook into a single Tableau
workbook file with extracted data named “student_number.tbwx”. Put both items (i.e., the Tableau-packaged
workbook and dataset used to construct the dashboard) into a folder, zip it and submit the zipped folder to ECA_PPT.

2
Recap – Seminar 1
o Data Visualisation
o Benefits
o Stages – Four basic stages and three loops
o Data – Attributes and Measurement (Nominal, Ordinal, Ratio and Interval)
o Various type of data – Categorical, Time Series, Spatial, Multi Variable and Distribution

o Business Performance Measures


o 7 Principles
o Strategic Business Performance Management Framework
o Vision and Mission
o Strategic Themes – main high level business strategies; major areas that the organization will focus on
o Strategic Objectives – defined objectives; 4 aspects need to be considered – Impact, measureable, frequency and
availability and accessibility
o Strategy Map – a diagram that is used to communicate the strategic objectives
o Business Performance Targets – SMART
o Strategic Initiatives – specific projects, programmes or planned activities to help to meet or exceed the targets

3
Recap – Seminar 1
o Dashboards
o 3 types – Strategic, Tactical and Operational
o Benefits
o Design Principles
o Tableau Hands-On

4
Connecting the dots…

Strategic Strategic
Data Mining Dashboard
Objectives Initiatives

Increase
Monitor
Increase sales of Who to
sales of
Sales by $1M Product X by target?
product X
$200K

Increase Monitor the


Reduce Who will number of
Market Share churn by 15% churn?
by 5% churners

5
Connecting the dots…

Strategic Strategic
Dashboard Data Mining
Objectives Initiatives

Increase
Monitor
Increase sales of Who to
sales of
Sales by $1M Product X by target?
product X
$200K

Increase Monitor the Who will


Reduce
Market Share number of churn?
churn by 15%
by 5% churners

6
Topics
• Overview of Data Mining
• Data Mining Methodology
o CRISP-DM
• Data Mining Techniques
o Description and Visualisation
o Association Analysis
o Clustering
o Predictive Modelling
• Pre-Requisites of Successful Data Mining
• Limitations of Data Mining
• Case Study: New York City Fights Fire with Data
• Demo

7
Overview of Data Mining
Overview of Data Mining
How Data Mining came about?
• Data explosion problem
• There is a tremendous increase in the amount of data recorded and stored on digital media
• We are drowning in data, but starving for knowledge!

“The greatest problem of today is how to teach people to ignore the irrelevant, how to
refuse to know things, before they are suffocated. For too many facts are as bad as
none at all.” (W.H. Auden) 9
Overview of Data Mining
Definition of Data Mining
• The best way to understand data mining is to view it in a decision-making context. To make
good decisions, decision-makers need good information.
• Data Mining can be defined as the process of finding previously unknown and valid
information from databases and using that information to take certain actions or make good
decisions
• Uses techniques from the disciplines of
o statistics,
o artificial intelligence,
o and machine learning.

10
Overview of Data Mining

11
Overview of Data Mining

12
Overview of Data Mining

2 .Credit Card
1. Restaurant Payment 3. Credit Card Swipe

4. Transmission of Card
Holder and Transaction
Data
8a. Card Payment 7a. DM Result: Low
Approval Probability of Fraud

OR
6. DM Model Checking
of Fraudulent Card Use
8b. Card 7b. DM Result: High
Payment Non- Probability of Fraud 5. Card Company Processing
Approval of Data
13
Data Mining Methodology
Data Mining Methodology
CRoss-Industry Standard Process for Data Mining

15
Source: https://upload.wikimedia.org/wikipedia/commons/thumb/b/b9/CRISP-DM_Process_Diagram.png/1200px-CRISP-DM_Process_Diagram.png
Data Mining Methodology
Phase 1 – Business Understanding
This phase comprises four tasks:

• Determine business objectives

• Assess situation

• Determine data mining goals

• Produce project plan

16
Data Mining Methodology
What are the data mining goals?

?
17
Data Mining Methodology
Phase 2 – Data Understanding
There are three tasks in this phase:

• Collect data

• Describe and explore data

• Verify data quality

18
Data Mining Methodology
Phase 3 – Data Preparation
Data preparation can be divided into the following subtasks:

• Select data

• Clean data

• Construct and integrate data

• Format data

19
Data Mining Methodology
Phase 4 – Modeling
This phase can be divided into the following subtasks:

• Select the modeling technique

• Generate test design

• Build and assess model

20
Data Mining Methodology
Phase 5 – Evaluation
This phase can be divided into the following subtasks:

• Evaluate results

• Review process and determine next steps

21
Data Mining Methodology
Phase 6 – Deployment
This phase can be divided into the following subtasks:

• Plan deployment

• Monitoring and maintenance

• Final report

22
Data Mining Methodology
Overview of the CRISP-DM tasks

23
(Source: https://www.the-modelling-agency.com/crisp-dm.pdf)
Data Mining Methodology – An illustration
Case Study – ABC Bank
The number of credit card frauds in ABC Bank had been increasing over the past few years and
the bank had been suffering quite a fair bit of losses in the past few months. The bank had
engaged you and your group as a business analyst.

24
Data Mining Methodology – An illustration
Case Study – ABC Bank
Phase 1 – Business Understanding
Business Problem:
• Credit card fraud is causing banks to suffer losses to customers and also to the bank

Business Goal:
• Able to detect future credit card fraud accurately and prevent it from happening in future

Data Mining Goal:


• Able to accurately identify/predict credit card fraud based on some criteria of transactions
pattern.

25
Data Mining Methodology – An illustration
Case Study – ABC Bank
Phase 2 – Data Understanding
Collecting data:
Data assumed collected in past transaction history in data warehouse. What attributes should be
used? (Time/Venue/Type)

Describing and exploring data:


Use SPSS modeler ‘Data audit’ node to describe the data

Verifying data quality:


Use SPSS modeler ‘Data audit’ node to assess the quality

26
Data Mining Methodology – An illustration
Case Study – ABC Bank
Phase 3 – Data Preparation
Selecting data:
Set of attributes that are relevant to classification includes ‘Time of fraud’, ‘Day of fraud’ etc.

Cleaning data:
Remove any outliers, replace missing values etc

Constructing and integrating data:


Incorporate external databases or create new calculated fields. For example, we can integrate the
transaction database (which only contains transactions) and customer database (which only
contains customer information)

27
Data Mining Methodology – An illustration
Case Study – ABC Bank
Phase 4 – Modelling
Selecting the modelling technique:
Classification – decision trees, neural networks.

Generating test design:


Evaluate the Decision tree performance using SPSS Analysis node by comparing the predicted and
actual target values for both training and testing data

Building and assessing the model:


Using 70% of the past fraud transaction databases as training data and 30% are testing data.

28
Data Mining Methodology – An illustration
Case Study – ABC Bank
Phase 5 – Evaluation
Evaluating results:
Identify the best tool for data mining classification. Check the % of testing data that are correctly
classified.

Reviewing the process and determining the next step:


If there is a need to improve the model, we have to start from the initial phase of business
understanding.

Phase 6 – Deployment
Deployment:
If the model is ready we can proceed with the deployment.

Notes:
The model is based on past transactional data of past frauds, hence when there are new frauds, the
model should always be updated.

29
Data Mining Techniques
Data Mining Techniques
Description and Visualisation
• Description and visualisation can contribute towards the understanding and detection of hidden
patterns in the data.

• Description refers to the summarisation of data to facilitate understanding.

• Visualisation can be considered an enhanced graphical approach that allows user input.

31
Data Mining Techniques
Association Analysis
• Association analysis is a descriptive technique that finds grouping by looking for patterns or
clusters among a set of items.

• The objective of association analysis is to determine which items co-occur frequently within a
database.

• Association analysis produces rules that are intuitive and easy to understand.

• Possible applications:
o Market Basket Analysis: What are the items inside each market basket?
o Fraud Detection: Association of certain patterns or incidences in transactions may flag off a
potential fraud.

32
Data Mining Techniques
Association Analysis
• An itemset is a collection of one or more items.

• Association rules are statements in the form of:


antecedent(s) then consequent(s) : antecedent(s) consequent(s)

• Association implies co- occurrence and not causality.

• Support is defined as the frequency of occurrence of an itemset. Support = % of times X appear = P(X)

• Rule Support is the fraction of total transactions that contain both X and Y.
Rule support = % of times X and Y appear together= P(X and Y)

• Confidence measures how often the items in the consequent set Y appear in antecedent transactions
containing X.
Confidence = likelihood that Y appears when X occurs = P(X and Y)/P(X)

33
Data Mining Techniques
Exercise
ID Item

1 Bread, Jam

2 Bread, Chips, Biscuits, Ice-cream

3 Jam, Chips, Biscuits, Muffins

4 Bread, Jam, Chips, Biscuits, Muffins

5 Bread, Jam, Chips, Muffins

1. What is the rule support for {Jam, Bread} ?

2. What is the confidence for {Jam, Chips}{Biscuits}?

34
Data Mining Techniques
Solution ID Item

1 Bread, Jam

2 Bread, Chips, Biscuits, Ice-cream

3 Jam, Chips, Biscuits, Muffins

4 Bread, Jam, Chips, Biscuits, Muffins

5 Bread, Jam, Chips, Muffins

1. What is the rule support for {Jam, Bread} ? 60%

2. What is the confidence for {Jam, Chips}{Biscuits}?


X = Jam and Chips; Y = Biscuits
Support count for (X,Y) = {Jam, Chips, Biscuits} is 2 (IDs 3 and 4).
Support count for X is 3 (IDs 3, 4 and 5).
Hence, this meant that 67% of the transactions that contain Jam and Chips will contain biscuits. 35
Data Mining Techniques
Clustering
• Clustering is an exploratory technique that attempts to discover natural groupings in data.

• The objective is to group similar (homogeneous) objects into the same cluster and dissimilar
(heterogeneous) objects into different clusters on the basis of distances among these objects.

• Domain knowledge plays an important role in deciding on the most useful results.

• Possible applications:

o Customer Segmentation: Customer population is partitioned into distinct groups based on


purchasing patterns and other customer characteristics

o Fraud Detection: Identifying subgroups/clusters that might behave differently or abnormally

36
Data Mining Techniques
Clustering

37
Data Mining Techniques
Clustering
• Cluster sizes should not be too small (i.e. containing only a very small number of observations)
so that the clusters are meaningful to use. The rule-of-thumb for the minimum number of
observations in a cluster is about 5 to 10 percent of the dataset size.

• A clustering solution should not consist of too many clusters because deployment of the
results will then be less feasible or useful.

• The number of clustering criteria should not be excessive, so that the clustering solution can
be meaningful interpreted and used.

38
Data Mining Techniques
Clustering
In K-means clustering, “K” refers to the number of clusters and “means” the cluster centroids
(i.e., the centre or average of all the observations within a cluster).

39
Data Mining Techniques
Predictive Modelling
• Classification refers to the prediction of a target that is categorical in nature. Examples of classification
include predicting fraud versus non-fraud, high-risk versus low-risk or purchase versus non-purchase.

• Estimation, on the other hand, refers to the prediction of a target that is quantitative (i.e., continuous) in
nature (e.g., predicting the amount spent, duration of a telephone call or account balance).

• Predictive modelling attempts to predict a target (also called a response variable or a dependent variable) on
the basis of one or more inputs (also called predictor variables or independent variables).

40
Data Mining Techniques
Predictive Modelling
• Three data mining tools are commonly used for predictive modelling, namely, regression,

neural networks and decision trees.

• There is no one best data mining tool for predictive modelling as each of these models has its

own pros and cons.

Logistics Regression Neural Network Decision Tree 41


Data Mining Techniques
Predictive Modelling
Data mining is used in many industries and has been proven to aid decision making.
• Telecommunication: One of the most common use of predictive modelling in the telecommunication
industry is to predict whether a customer is likely to churn.

• Banking: Credit scoring, an application of predictive modelling, is widely used in banks to aid making
decisions on whether to offer credit to their customers.

• Retail: Retail shops are using predictive modelling to aid in their up-selling and cross-selling effort.
They will send mailers to customers who are likely to respond to upselling or cross-selling campaign
so as to achieve higher profit with lower marketing cost.

42
Data Mining Techniques
Predictive Modelling Methodology

43
Source: https://upload.wikimedia.org/wikipedia/commons/thumb/b/b9/CRISP-DM_Process_Diagram.png/1200px-CRISP-DM_Process_Diagram.png
Data Mining Techniques
Predictive Modelling: Decision Trees
• A decision tree comprises a series of decision points, where the entire set of observations or subset of these
observations is split on the basis of some splitting criteria.

• Subsets are similar in characteristics

• Classification Tree
Able to predict the values of a categorical target variable, given a set of inputs

• Regression Tree
Able to estimate the values of a continuous target variable, given the input

• Recursive partitioning algorithm

44
Data Mining Techniques
Predictive Modelling: Decision Trees
What are the rules for the buyers?

45
Data Mining Techniques
Predictive Modelling: Decision Trees
What are the rules for the buyers?

Rules for Buyers:


Node 4: Income<$100,000;
Age>=25 (75%)

Node 8: Income>=$100,000;
Male;
Race is Malay or Indian (85%)

46
Data Mining Techniques
Predictive Modelling: Evaluation

• A “good” predictive model should have predictive performance that is


considered effective for practical application.

• In general, there are two aspects to consider when evaluating model


performance:
o the performance measure, and
o the model evaluation method.

47
Data Mining Techniques
Predictive Modelling: Evaluation
Performance Measure

p+v p+v p+v q+u


overall accuracy = = error rate = 1 − accuracy = 1 − ( ) or
m m
p+q+u +v m
v v
accuracy for positive cases = hit rate for positive cases =
u+v q+v

p p
accuracy for negative cases = hit rate for negative cases =
p+q p+u
48
Data Mining Techniques
Predictive Modelling: Evaluation
Simple Train- and- Test Evaluation Method
• Partition the data into a training data and testing data. Construct the model
based on the training data.

• Test the model using the testing data. The testing data serves as a data set that
the model has not seen before.

• The Simple Train-and-Test approach allows analysts to infer how well the model
would perform on unseen data; that is, observations that have not been seen by
the model.

49
Data Mining Techniques
Predictive Modelling: Evaluation
Performance Measure
Predicted Response
Response Yes No Total
Yes 65 35 100
No 30 70 100
Total 95 105 200

Based on the confusion matrix, what is the


overall accuracy rate, the overall error rate
and the hit rate for “Yes” ?
50
Data Mining Techniques
Predictive Modelling: Evaluation
Performance Measure
Predicted Response

Response Yes No Total

Yes 65 35 100

No 30 70 100

Total 95 105 200

• Overall accuracy rate of the predictive model is 67.5% (i.e., (65 + 70)/200)
• Overall error rate is 32.5% (i.e., 1 – (65 + 70)/200)
• Hit rate for “Yes” is 68.4% (i.e., 65/95).

51
Pre-Requisites and
Limitations
Pre-Requisites and Limitations
The state of the data, the coordination among different departments and
the complexity of a data mining project are important determinants of the
duration of the data mining project.

The data mining team should possess the following knowledge and skills:
domain knowledge, data mining knowledge and skills, IT expertise, and
statistical and research expertise.

53
Pre-Requisites and Limitations
• Some patterns, trends and relationships found in data mining may not be useful.

• Data mining is well developed for modelling (i.e., prediction) but it is not as well
developed for effect assessment.

• Substantial investment of resources in data mining.

• Data mining requires intensive planning and technical preparation work.

• Collective knowledge and skills are needed hence different departments in an


organisation have to work closely together

54
New York City Fights Fire
with Data
New York City Fights Fire with Data
Based on the article, what do you think is

• Business issue

• Data mining objective

• Data mining technique

• Data mining application

• How the application had helped to address the business issue

• Limitations of the application

56
http://www.govtech.com/public-safety/New-York-City-Fights-Fire-with-Data.html
New York City Fights Fire with Data
Based on the article, what do you think is

1. Business issue
• Rising volume of building inspection requirements.

2. Data mining objective


• Predict where fires might spark

3. Data mining technique


• Each data element is given a weight to calculate fire risk; similar to credit scoring; likely
technique used is regression

4. Data mining application


• Risk score engine; Higher risk score buildings were inspected first.

57
http://www.govtech.com/public-safety/New-York-City-Fights-Fire-with-Data.html
New York City Fights Fire with Data
5. How the application had helped to address the business issue?
• Helped to better deploy the limited resources to inspect building with higher fire risk score
before the fire sparks.

6. Limitations of the application


• Manual inputs by inspectors subjected to data issues
• Only applicable to New York city

58
http://www.govtech.com/public-safety/New-York-City-Fights-Fire-with-Data.html
Data Mining Demo
– SPSS Modeler
Demo
• MailPurchase is a mail order company with a database of 1400 customers.
• For each customer, the following data for last year were captured.
Attribute Description
Status Whether the customer has purchased a promoted product in any of the
quarterly marketing campaigns
Expend Average monthly expenditure on the company’s products
Numpur Average number of purchases per quarter
Age Age of customer as at 1 January
Gender Gender of customer
Income Annual income of customer as at 1 January (in $’000)
Race Race of customer
Marital Marital status of customer as at 1 January
Member Whether the customer is a member of the loyalty card programme

60
Demo
• Suppose that to develop the next marketing campaign, MailPurchase is interested to target
only existing customers with a high probability of purchase.

• Hence it is interesting to classify existing customers as likely purchasers or non-purchasers.

• To construct this prediction model, MailPurchase has decided to use ‘Status’ as the target
variable and the others as inputs.

• From the prediction model, MailPurchase will be able to predict the probability of purchase
and hence be able to classify existing customers into the purchaser and non-purchaser
groups.

61
Demo

62
Demo

CHAID is
chosen as the
final model

63
Demo

64
Node First Second Third Fourth Fifth Probability
Variable Variable Variable Variable Variable

Demo 20 Race
=1, 2 or 4
Member
=0
Gender
=1
Marital
=3
0.596

24 Race Member Gender Age Income 0.909


=1, 2 or 4 =0 =2 <=25 <=92
10 Race Member Age 0.800
=1, 2 or 4 =1 <=29
11 Race Member Age 0.635
=1, 2 or 4 =1 >29 and <=43
13 Race Member Age 0.625
=1, 2 or 4 =1 >60
14 Race Marital Income 0.765
=3 =2 <=61
15 Race Marital Income 0.970
=3 =2 >61 and <=186
17 Race Marital Gender 0.919
=3 =3 =1
18 Race Marital Gender 0.600
=3 =3 =2
65
suss.edu.sg

You might also like