Professional Documents
Culture Documents
You can modify the dataset accordingly to suit your needs. For example, rename some of the variables, derive new
variables from the dataset or/and create your own variables.
If you do have a better dataset and will like to use that to build your dashboard, pls do so. You will need to state the
source of your dataset in your report.
IMPORTANT
Provide screenshots of the dashboard in your report, and store the Tableau-packaged workbook into a single Tableau
workbook file with extracted data named “student_number.tbwx”. Put both items (i.e., the Tableau-packaged
workbook and dataset used to construct the dashboard) into a folder, zip it and submit the zipped folder to ECA_PPT.
2
Recap – Seminar 1
o Data Visualisation
o Benefits
o Stages – Four basic stages and three loops
o Data – Attributes and Measurement (Nominal, Ordinal, Ratio and Interval)
o Various type of data – Categorical, Time Series, Spatial, Multi Variable and Distribution
3
Recap – Seminar 1
o Dashboards
o 3 types – Strategic, Tactical and Operational
o Benefits
o Design Principles
o Tableau Hands-On
4
Connecting the dots…
Strategic Strategic
Data Mining Dashboard
Objectives Initiatives
Increase
Monitor
Increase sales of Who to
sales of
Sales by $1M Product X by target?
product X
$200K
5
Connecting the dots…
Strategic Strategic
Dashboard Data Mining
Objectives Initiatives
Increase
Monitor
Increase sales of Who to
sales of
Sales by $1M Product X by target?
product X
$200K
6
Topics
• Overview of Data Mining
• Data Mining Methodology
o CRISP-DM
• Data Mining Techniques
o Description and Visualisation
o Association Analysis
o Clustering
o Predictive Modelling
• Pre-Requisites of Successful Data Mining
• Limitations of Data Mining
• Case Study: New York City Fights Fire with Data
• Demo
7
Overview of Data Mining
Overview of Data Mining
How Data Mining came about?
• Data explosion problem
• There is a tremendous increase in the amount of data recorded and stored on digital media
• We are drowning in data, but starving for knowledge!
“The greatest problem of today is how to teach people to ignore the irrelevant, how to
refuse to know things, before they are suffocated. For too many facts are as bad as
none at all.” (W.H. Auden) 9
Overview of Data Mining
Definition of Data Mining
• The best way to understand data mining is to view it in a decision-making context. To make
good decisions, decision-makers need good information.
• Data Mining can be defined as the process of finding previously unknown and valid
information from databases and using that information to take certain actions or make good
decisions
• Uses techniques from the disciplines of
o statistics,
o artificial intelligence,
o and machine learning.
10
Overview of Data Mining
11
Overview of Data Mining
12
Overview of Data Mining
2 .Credit Card
1. Restaurant Payment 3. Credit Card Swipe
4. Transmission of Card
Holder and Transaction
Data
8a. Card Payment 7a. DM Result: Low
Approval Probability of Fraud
OR
6. DM Model Checking
of Fraudulent Card Use
8b. Card 7b. DM Result: High
Payment Non- Probability of Fraud 5. Card Company Processing
Approval of Data
13
Data Mining Methodology
Data Mining Methodology
CRoss-Industry Standard Process for Data Mining
15
Source: https://upload.wikimedia.org/wikipedia/commons/thumb/b/b9/CRISP-DM_Process_Diagram.png/1200px-CRISP-DM_Process_Diagram.png
Data Mining Methodology
Phase 1 – Business Understanding
This phase comprises four tasks:
• Assess situation
16
Data Mining Methodology
What are the data mining goals?
?
17
Data Mining Methodology
Phase 2 – Data Understanding
There are three tasks in this phase:
• Collect data
18
Data Mining Methodology
Phase 3 – Data Preparation
Data preparation can be divided into the following subtasks:
• Select data
• Clean data
• Format data
19
Data Mining Methodology
Phase 4 – Modeling
This phase can be divided into the following subtasks:
20
Data Mining Methodology
Phase 5 – Evaluation
This phase can be divided into the following subtasks:
• Evaluate results
21
Data Mining Methodology
Phase 6 – Deployment
This phase can be divided into the following subtasks:
• Plan deployment
• Final report
22
Data Mining Methodology
Overview of the CRISP-DM tasks
23
(Source: https://www.the-modelling-agency.com/crisp-dm.pdf)
Data Mining Methodology – An illustration
Case Study – ABC Bank
The number of credit card frauds in ABC Bank had been increasing over the past few years and
the bank had been suffering quite a fair bit of losses in the past few months. The bank had
engaged you and your group as a business analyst.
24
Data Mining Methodology – An illustration
Case Study – ABC Bank
Phase 1 – Business Understanding
Business Problem:
• Credit card fraud is causing banks to suffer losses to customers and also to the bank
Business Goal:
• Able to detect future credit card fraud accurately and prevent it from happening in future
25
Data Mining Methodology – An illustration
Case Study – ABC Bank
Phase 2 – Data Understanding
Collecting data:
Data assumed collected in past transaction history in data warehouse. What attributes should be
used? (Time/Venue/Type)
26
Data Mining Methodology – An illustration
Case Study – ABC Bank
Phase 3 – Data Preparation
Selecting data:
Set of attributes that are relevant to classification includes ‘Time of fraud’, ‘Day of fraud’ etc.
Cleaning data:
Remove any outliers, replace missing values etc
27
Data Mining Methodology – An illustration
Case Study – ABC Bank
Phase 4 – Modelling
Selecting the modelling technique:
Classification – decision trees, neural networks.
28
Data Mining Methodology – An illustration
Case Study – ABC Bank
Phase 5 – Evaluation
Evaluating results:
Identify the best tool for data mining classification. Check the % of testing data that are correctly
classified.
Phase 6 – Deployment
Deployment:
If the model is ready we can proceed with the deployment.
Notes:
The model is based on past transactional data of past frauds, hence when there are new frauds, the
model should always be updated.
29
Data Mining Techniques
Data Mining Techniques
Description and Visualisation
• Description and visualisation can contribute towards the understanding and detection of hidden
patterns in the data.
• Visualisation can be considered an enhanced graphical approach that allows user input.
31
Data Mining Techniques
Association Analysis
• Association analysis is a descriptive technique that finds grouping by looking for patterns or
clusters among a set of items.
• The objective of association analysis is to determine which items co-occur frequently within a
database.
• Association analysis produces rules that are intuitive and easy to understand.
• Possible applications:
o Market Basket Analysis: What are the items inside each market basket?
o Fraud Detection: Association of certain patterns or incidences in transactions may flag off a
potential fraud.
32
Data Mining Techniques
Association Analysis
• An itemset is a collection of one or more items.
• Support is defined as the frequency of occurrence of an itemset. Support = % of times X appear = P(X)
• Rule Support is the fraction of total transactions that contain both X and Y.
Rule support = % of times X and Y appear together= P(X and Y)
• Confidence measures how often the items in the consequent set Y appear in antecedent transactions
containing X.
Confidence = likelihood that Y appears when X occurs = P(X and Y)/P(X)
33
Data Mining Techniques
Exercise
ID Item
1 Bread, Jam
34
Data Mining Techniques
Solution ID Item
1 Bread, Jam
• The objective is to group similar (homogeneous) objects into the same cluster and dissimilar
(heterogeneous) objects into different clusters on the basis of distances among these objects.
• Domain knowledge plays an important role in deciding on the most useful results.
• Possible applications:
36
Data Mining Techniques
Clustering
37
Data Mining Techniques
Clustering
• Cluster sizes should not be too small (i.e. containing only a very small number of observations)
so that the clusters are meaningful to use. The rule-of-thumb for the minimum number of
observations in a cluster is about 5 to 10 percent of the dataset size.
• A clustering solution should not consist of too many clusters because deployment of the
results will then be less feasible or useful.
• The number of clustering criteria should not be excessive, so that the clustering solution can
be meaningful interpreted and used.
38
Data Mining Techniques
Clustering
In K-means clustering, “K” refers to the number of clusters and “means” the cluster centroids
(i.e., the centre or average of all the observations within a cluster).
39
Data Mining Techniques
Predictive Modelling
• Classification refers to the prediction of a target that is categorical in nature. Examples of classification
include predicting fraud versus non-fraud, high-risk versus low-risk or purchase versus non-purchase.
• Estimation, on the other hand, refers to the prediction of a target that is quantitative (i.e., continuous) in
nature (e.g., predicting the amount spent, duration of a telephone call or account balance).
• Predictive modelling attempts to predict a target (also called a response variable or a dependent variable) on
the basis of one or more inputs (also called predictor variables or independent variables).
40
Data Mining Techniques
Predictive Modelling
• Three data mining tools are commonly used for predictive modelling, namely, regression,
• There is no one best data mining tool for predictive modelling as each of these models has its
• Banking: Credit scoring, an application of predictive modelling, is widely used in banks to aid making
decisions on whether to offer credit to their customers.
• Retail: Retail shops are using predictive modelling to aid in their up-selling and cross-selling effort.
They will send mailers to customers who are likely to respond to upselling or cross-selling campaign
so as to achieve higher profit with lower marketing cost.
42
Data Mining Techniques
Predictive Modelling Methodology
43
Source: https://upload.wikimedia.org/wikipedia/commons/thumb/b/b9/CRISP-DM_Process_Diagram.png/1200px-CRISP-DM_Process_Diagram.png
Data Mining Techniques
Predictive Modelling: Decision Trees
• A decision tree comprises a series of decision points, where the entire set of observations or subset of these
observations is split on the basis of some splitting criteria.
• Classification Tree
Able to predict the values of a categorical target variable, given a set of inputs
• Regression Tree
Able to estimate the values of a continuous target variable, given the input
44
Data Mining Techniques
Predictive Modelling: Decision Trees
What are the rules for the buyers?
45
Data Mining Techniques
Predictive Modelling: Decision Trees
What are the rules for the buyers?
Node 8: Income>=$100,000;
Male;
Race is Malay or Indian (85%)
46
Data Mining Techniques
Predictive Modelling: Evaluation
47
Data Mining Techniques
Predictive Modelling: Evaluation
Performance Measure
p p
accuracy for negative cases = hit rate for negative cases =
p+q p+u
48
Data Mining Techniques
Predictive Modelling: Evaluation
Simple Train- and- Test Evaluation Method
• Partition the data into a training data and testing data. Construct the model
based on the training data.
• Test the model using the testing data. The testing data serves as a data set that
the model has not seen before.
• The Simple Train-and-Test approach allows analysts to infer how well the model
would perform on unseen data; that is, observations that have not been seen by
the model.
49
Data Mining Techniques
Predictive Modelling: Evaluation
Performance Measure
Predicted Response
Response Yes No Total
Yes 65 35 100
No 30 70 100
Total 95 105 200
Yes 65 35 100
No 30 70 100
• Overall accuracy rate of the predictive model is 67.5% (i.e., (65 + 70)/200)
• Overall error rate is 32.5% (i.e., 1 – (65 + 70)/200)
• Hit rate for “Yes” is 68.4% (i.e., 65/95).
51
Pre-Requisites and
Limitations
Pre-Requisites and Limitations
The state of the data, the coordination among different departments and
the complexity of a data mining project are important determinants of the
duration of the data mining project.
The data mining team should possess the following knowledge and skills:
domain knowledge, data mining knowledge and skills, IT expertise, and
statistical and research expertise.
53
Pre-Requisites and Limitations
• Some patterns, trends and relationships found in data mining may not be useful.
• Data mining is well developed for modelling (i.e., prediction) but it is not as well
developed for effect assessment.
54
New York City Fights Fire
with Data
New York City Fights Fire with Data
Based on the article, what do you think is
• Business issue
56
http://www.govtech.com/public-safety/New-York-City-Fights-Fire-with-Data.html
New York City Fights Fire with Data
Based on the article, what do you think is
1. Business issue
• Rising volume of building inspection requirements.
57
http://www.govtech.com/public-safety/New-York-City-Fights-Fire-with-Data.html
New York City Fights Fire with Data
5. How the application had helped to address the business issue?
• Helped to better deploy the limited resources to inspect building with higher fire risk score
before the fire sparks.
58
http://www.govtech.com/public-safety/New-York-City-Fights-Fire-with-Data.html
Data Mining Demo
– SPSS Modeler
Demo
• MailPurchase is a mail order company with a database of 1400 customers.
• For each customer, the following data for last year were captured.
Attribute Description
Status Whether the customer has purchased a promoted product in any of the
quarterly marketing campaigns
Expend Average monthly expenditure on the company’s products
Numpur Average number of purchases per quarter
Age Age of customer as at 1 January
Gender Gender of customer
Income Annual income of customer as at 1 January (in $’000)
Race Race of customer
Marital Marital status of customer as at 1 January
Member Whether the customer is a member of the loyalty card programme
60
Demo
• Suppose that to develop the next marketing campaign, MailPurchase is interested to target
only existing customers with a high probability of purchase.
• To construct this prediction model, MailPurchase has decided to use ‘Status’ as the target
variable and the others as inputs.
• From the prediction model, MailPurchase will be able to predict the probability of purchase
and hence be able to classify existing customers into the purchaser and non-purchaser
groups.
61
Demo
62
Demo
CHAID is
chosen as the
final model
63
Demo
64
Node First Second Third Fourth Fifth Probability
Variable Variable Variable Variable Variable
Demo 20 Race
=1, 2 or 4
Member
=0
Gender
=1
Marital
=3
0.596