You are on page 1of 52

Business analytics

Lecture 3- data mining


Data mining
▪Ever changing needs of customer
▪Untapped value hidden in large
databases
▪Single view of customers , vendors ,
partners, transactions
▪Exponential increase in data
processing technologies and storage
capabilities
▪Movement towards demassification
2
Data mining
▪The nontrivial process of identifying
valid, novel, potentially useful, and
ultimately understandable patterns in
data stored in structured databases.
- Fayyad et al., (1996)
▪Keywords in this definition: Process,
nontrivial, valid, novel, potentially
useful, understandable patterns.

3
Data mining

4
Data mining
▪Data mining helps end-users extract
useful business information from large
databases. It is a process of discovering
new knowledge from databases
▪Data mining promises to fix the problem
of miscommunication between user and
the data and allows user to ask complex
questions for easily navigating and
visualizing data
5
Data mining
▪DM extracts patterns from data
▪Pattern? A mathematical (numeric and/or
symbolic) relationship among data items

▪Types of patterns
▪Association
▪Prediction
▪Cluster (segmentation)
▪Sequential (or time series) relationships

6
Taxonomy of Data mining

Data

Unstructured or
Structured
Semi-Structured

Categorical Numerical Textual Multimedia HTML/XML

Nominal Ordinal Interval Ratio

7
Taxonomy of Data mining

8
Data mining
▪Data mining is a
▪Process
▪Non trivial --- experimentation
▪Valid– patterns to hold true for new
data
▪Novel- previously unknown patterns
▪Potentially useful- business benefits
▪Understandable - business sense
9
Data mining & Statistics
▪Despite the overlapping usage and
even technology, data mining is
bringing something new to the party –
namely an easy way for business and
user professionals to access the
power of statistics
▪Real opportunity provided by data
mining– it empowers the end user,
same as excel spreadsheet
10
Data mining
▪ Main difference between data mining
and statistics is that data mining is
meant to be used by the business end
users – not statistician.
▪Statistics starts with well defined
proposition/ hypothesis. Data mining
starts with loosely defined discovery
statement.
▪Data mining effectively automates the
statistical process thereby relieving the
end user of some burden
11
Data mining
▪Data mining has proved to be a powerful
force in creating a new functional tool for
business end-users, like spreadsheet.
▪To achieve this, data mining must be
deployed as an embedded technology
within the data warehouse.
▪When this is done it ensures
improvements in accuracy, speed and
cost.
12
Data mining
▪Data mining is the process of
analyzing data to extract information
▪It can begin at the summary
information level (course granularity)
and progress through increasing
levels of detail.

13
Data mining applications
▪Customer relationship management
▪Banking
▪Retail and logistics
▪Manufacturing and production
▪Brokerage and securities trading
▪Insurance
▪Travel
▪Healthcare, medicine
▪Security
▪Sports 14
Data mining
▪ Business benefits of BI
▪Single-point access to information
for all users
▪BI across organizational
departments
▪Up-to-the-minute information for
everyone

15
Data mining
▪ Categories of BA benefits
▪Quantifiable benefits- saved time in
producing reports
▪Indirectly quantifiable benefits-
improved customer service leading
higher sales
▪Unpredictable benefits-
▪Intangible benefits.-better
communication
16
Data mining
▪Data mining works the same way as
humans. It uses historical information
(experience) to learn
▪But you have to tell the data mining
tools what the ”GOLD” looks like
▪Data mining tool can automatically go
through the entire database and find
even the smallest pattern which may
help in better prediction
17
Discovery versus Prediction
▪ Data mining includes prediction as one
of its benefits. It not only analyses the
mountain of data but can also tell you,
what the mountain will look like next
month
▪Data mining helps in understanding the
tectonics of mountains – how they are
changing and moving – not just what
they look like today.
▪These features are called as discovery
18
Discovery versus Prediction
▪ Very often the users also do not know
what the gold looks like within the
model of data.
▪Data mining system works in making
three measurements (1) how strong,
(2) unexpected (3) ubiquitous the
association is.

19
Discovery versus prediction
▪In prediction, the end user has a very specific
event or attribute and likes to find a pattern in
association with the same.
▪Predictive models can be supervised or
unsupervised learning
▪Supervised learning involves building a
model for the specific purpose of optimally
predicting some targets within the historical
database
▪Unsupervised learning does not have any
well-defined goal or target to predict.
20
▪How to use WEKA software for data
mining tasks
▪https://www.youtube.com/watch?v=U6
3ExiTJMic&t=21s
▪KNIME: A Tool for Data Mining
▪https://www.youtube.com/watch?v=V1
KbJL4uutQ
Discovery versus prediction
▪ Predictions are largely experience
and opinion based whereas
forecasting is data and model based.

▪Predictions in data mining includes


classification and regression analysis

22
Data mining Process
Cross industry standard process for
data mining ( CRISP-DM)
▪Business Understanding (85% of pro-
▪Data understanding ject time)
▪Data preparation– consolidation,
cleaning, transformation, reduction
▪Model Building
▪Testing evaluation
▪Deployment 23
Data mining Process CRISP-DM

24
25

Data Preparation – A critical task

Collect data
Select data
Integrate data

Impute missing values


Reduce noise in data
Eliminate inconsistencies

Normalise data
Aggregate data
Construct new attributes

Reduce number of
variables
Reduce number of cases
Balance skewed data
Data mining Process
▪SEMMA– Sample, Explore ,Modify ,
Model ,Assess
▪KDD- Data selection, Data
preprocessing, Data transformation
,Data mining, interpretation /
evaluation

26
Data mining Process SEMMA

27
Data mining effectiveness
▪Data mining effectiveness is measured by
▪ Accuracy,
▪Speed and
▪Cost
▪Often the accuracy of predictions depends on
the correct deployment of the technology and
the quality of data than on technology itself
▪Choice of data mining tools should be driven
by the advantages it brings to the bottom line
of business process and not just on the
basis of statistical accuracy
28
Data mining effectiveness
▪ Data mining is often measured by
speed. The faster the tool runs, the
larger the dataset to which it can be
applied.
▪Data analysis must be
▪Embedded into data warehouse
▪Understandable and usable by the
user department professionals
29
Data mining effectiveness
In classification problems, the primary source
for accuracy estimation is the confusion
matrix 𝑇𝑃 + 𝑇𝑁
TRUE
True CLASS
Class 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 + 𝑇𝑁
Positive Negative 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
Positive Negative 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
𝑇𝑃
𝑇𝑃
CLASS
Positive

𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑅𝑎𝑡𝑒 =


Positive

True False 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑃


𝑅𝑎𝑡𝑒 =
+ 𝐹𝑁
𝑇𝑃 + 𝐹𝑁
Predicted Class

Positive Positive
Count (TP) Count (FP)
PREDICTED

Precision=
TP TP/ TP+FP
P recision =
TP + FP
Negative

False True
Negative

Negative Negative
Recall = TP
TP/ TP+FN
Count (FN) Count (TN) P recision =
TP + FP𝑅𝑎𝑡𝑒 =
𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
𝑇𝑁
𝑇𝑁 + 𝐹𝑃
30
Data mining software products
▪ Targeted solutions – power of data
mining and applying it to a particular
problem or industry (industry specific
solutions)
▪Business tools – targeted towards
business end-users in a way that is
easy to use and understandable for a
business user to get value from the
tool
31
Data mining software products
▪ Business analyst tools – for users of
business applications in which user
building the tool has some sense of
how data mining works and what
some of the different variations
accomplish
▪Research analyst tools – targeted
towards data mining researcher ,
statistical analyst
32
Evaluation of tools
▪ Measures for evaluation of tools are
▪Accuracy – is the technique
automated and easy-to-use
▪Clarity –are the results clear,
understandable
▪ROI –is a tool useful for achieving
bottom-line results and improve the
return on investment
33
Evaluation of tools

▪ Clusters – can the technique be used easily for
perform clustering on input data
▪ Links – can the technique be used to find links
between records
▪ Outliers – can the technique be used to detect
abnormal records
▪ Rules – can the technique efficiently create
rules in the database to make predictions
▪ Text – can the technique be used for clustering
and prediction of textual information
▪ Sequences – can the technique be used for
time series of sequential data prediction
34
Evaluation of tools
▪ Accuracy – how accurate is the dominant
technology to get the right answer
▪Clarity
▪Dirty data- can it handle data with missing
values and errors
▪Raw data- does it require lot of
preprocessing
▪RDBMS-can it be embedded directly in
RDBMS
▪Scalability
▪Speed
▪Validation 35
Data mining algorithms
▪Classification- learns patterns from past
data with previously labelled items to
place new instances( unlabelled) into
their groups e.g. credit approval (good,
bad), weather(sunny, cloudy)
▪Analyse historical data and automatically
generate a model which can predict
future behaviour
▪Process involves model development /
training and model testing / deployment
37
Data mining algorithms
▪Factors for assessing Classification-
▪Predictive accuracy
▪Speed
▪Robustness
▪Scalability
▪Interpretability

38
Cluster Analysis for Data Mining
▪Used for automatic identification of
natural groupings of things
▪Part of the machine-learning family
▪Employs unsupervised learning
▪Learns the clusters of things from past
data, then assigns new instances
▪There is not an output variable
▪Also known as segmentation
Data mining algorithms
▪K fold cross validation
▪Complete data set is randomly split
into k folds ( mutually exclusive
subsets of approx. equal size)
▪The model is trained and tested k
times. Each time trained on k-1 folds
and tested on kth fold
▪Overall accuracy=1/k( sum of all A’s)
where A is the accuracy measure)
40
Cluster Analysis for Data Mining
▪k-Means Clustering Algorithm
▪k : pre-determined number of clusters
▪Algorithm (Step 0: determine value of k)
Step 1: Randomly generate k random points as initial
cluster centers.
Step 2: Assign each point to the nearest cluster
center.
Step 3: Re-compute the new cluster centers.
Repetition step: Repeat steps 3 and 4 until some
convergence criterion is met (usually that the
assignment of points to clusters becomes stable).
Data mining algorithms
▪Association rule mining
▪retaining existing customers
▪enhancing quality of customer
experience
▪converting the into sales
▪Association rule mining also referred
as Market Basket analysis, affinity
analysis
42
Data mining algorithms
▪Association rule mining
▪ Support for an item set is defined
as: support(Milk, diapers) = number
of transactions containing(Milk,
diapers)/total number of transactions
▪Confidence of (milk, diapers)
implies(Beer) equal to support for
(milk, diapers)-> (beer)/support for
(milk, diapers)
43
Association Rule Mining
▪Are all association rules interesting and useful?
A Generic Rule: X  Y [S%, C%]

X, Y: products and/or services


X: Left-hand-side (LHS)
Y: Right-hand-side (RHS)
S: Support: how often X and Y go together
C: Confidence: how often Y goes together with X

Example: {Laptop Computer, Antivirus Software} 


{Extended Service Plan} [30%, 70%]
Data mining algorithms
Trans Bread Milk Eggs Diaper Beer
ID s
1 1 1 0 0 0
2 1 1 1 1 1
3 1 1 0 1 1
4 0 0 0 1 1
5 1 1 1 1 0
6 1 1 0 1 1
▪Support for (milk, diapers)->beer= 3/6
▪Confidence for milk, diapers->beer=
support for milk, diapers->beer/ support
for milk, diapers= 0.5/0.67=0.7462 45
Association Rule Mining
Apriori Algorithm
Raw data One-Item Itemset Two -Item Itemset Three Item Itemset
Raw Transaction Data One-item Itemsets Two-item Itemsets Three-item Itemsets

Transaction SKUs Itemset Itemset Itemset


Support Support Support
No (Item No) (SKUs) (SKUs) (SKUs)

1 1, 2, 3, 4 1 3 1, 2 3 1, 2, 4 3
1 2, 3, 4 2 6 1, 3 2 2, 3, 4 3
1 2, 3 3 4 1, 4 3
1 1, 2, 4 4 5 2, 3 4
1 1, 2, 3, 4 2, 4 5
1 2, 4 3, 4 3
Data mining algorithms
▪Association rule mining
▪Confidence is a measure of reliability
of inference of an association rule
▪Support may happen by chance
▪Lift(X->Y)= support(X->Y) Divided by
Support(X) *Support(Y)
▪ Lift value is ratio of confidence of the
rule and expected confidence of the
rule
47
Data mining algorithms
▪Decision tree: a decision support tool.
Uses a treelike graph depict decision
and its consequences.
▪It has decision nodes represented by
squares, chance node represented by
circles and end nodes represented by
triangles
▪Easy to interpret, it is a plot, easy to
couple with other decision techniques
48
Data mining algorithms
▪Decision tree: algorithm recursively
divides the training set until each
division consists entirely or primarily
examples of only one class
▪Select the best splitting attribute, split
the data into exclusive(non
overlapping) subsets, repeat the
process
49
Decision Trees
▪DT algorithms mainly differ on
1. Splitting criteria
▪ Which variable, what value, etc.
2. Stopping criteria
▪ When to stop building the tree
3. Pruning (generalization method)
▪ Pre-pruning versus post-pruning
Data mining algorithms
▪Cluster analysis – exploratory data
analysis tool for solving classification
problems
▪Optimal number of clusters (n/2)1/2
▪K means clustering: it aims to partition
observations into K clusters in which
each observation belongs to the cluster
with the nearest mean, serving as a
prototype of the cluster
▪ Classification is supervised learning ,
clustering is unsupervised learning
51
Data mining algorithms
▪Data mining is a multi step process
requiring deliberate , proactive design
and use.
▪Data mining does not require
dedicated database
▪Web based tools enable managers to
do data mining

52
▪ Hands-On Guide To Market Basket Analysis With Python Codes
▪ https://analyticsindiamag.com/hands-on-guide-to-market-basket-
analysis-with-python-codes/
▪ MBA For Breakfast — A Simple Guide to Market Basket Analysis
▪ https://towardsdatascience.com/mba-for-breakfast-4c18164ef82b

Market Basket Analysis (Apriori) in Python
▪ https://www.kaggle.com/yugagrawal95/market-basket-analysis-
apriori-in-python
▪ Evolution of movies over the years
▪ https://www.kaggle.com/stephanerappeneau/evolution-of-movies-
over-the-years

Introduction to Decision Trees (Titanic dataset)
▪ https://www.kaggle.com/dmilla/introduction-to-decision-trees-titanic-
dataset

You might also like