Professional Documents
Culture Documents
net/publication/343163746
CITATIONS READS
0 343
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Abhishek Dutta on 23 July 2020.
Abhishek Dutta
(Reg. No: 182341810018)
Under the Supervision of Prof. Jaydip Sen
NSHM Knowledge Campus, Kolkata, INDIA
Affiliated to
Maulana Abul Kalam Azad University of Technology, Kolkata, INDIA
• The project of two real world datasets for market basket analysis. The first
data set has been analysed with the help of Python programming language
and the second dataset using R programing language.
• Association rule mining also has other applications such as web usage mining,
intrusion detection, bioinformatics etc.
(If) A → B (Then)
• Out of the two elements, we can say that A is the ‘antecedent’ (the IF part)
and B as the ‘consequent’ (the THEN part).
22-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 5
2018 - 2020
Association Rule Mining
If we consider a store data set as :
Transaction Item 1 Item 2 Item 3 Item 4
ID
1 Bread BBQ sauce - Meat
2 Meat Bread BBQ sauce Beer
3 Meat - - Beer
4 Beer BBQ sauce Meat -
5 Bread Meat Beer BBQ sauce
We can assume that if a consumer buys Meat are more likely to buy Beer. Thus,
Meat Beer
Similarly, people who buy Bread & Meat, tends to also buy Barbeque sauce and
Beer
IF {Meat, Bread} THEN {Beer, BBQ}
22-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 6
2018 - 2020
Market Basket Analysis
• The applications of Association Rule Mining are found in Marketing, Market
Basket Analysis (also known as Basket Data Analysis) in retailing, clustering and
classification.
• They find the relationships between the set of elements of every transactions
made, and not individual features. Hence, it is different from collaborative
filtering.
22-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 9
2018 - 2020
Metrics of Association Rules - Support
Support – Support tells us how frequently a particular item appears a dataset.
Thus, if we consider an item ‘A’ as antecedent and ‘B’ as consequent, then
support of A→B is the ratio of transactions containing items ‘A’ and ‘B’ to total
number of transactions.
50.0% of the transactions contain both milk and cookies together. Thus,
P(A|B) or P(B|A) is the support, which is also known as Coverage.
22-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 10
2018 - 2020
• Support helps us to identify the rules which are worthy of being considered for
analysis. The example which we discussed is a hypothetical situation, however,
in reality it is not that simple. A departmental store for example can have
10000 transactions, and in that case out support falls down to 0.0005.
• Items having lower support values are thus not sure about the relationship
between them due to lack of information, and we cannot draw a conclusion.
• For example – a team scoring 5 goals in a match has a very low chance of
losing the match. However, we chances of scoring 5 goals in a match is also
very low.
22-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 11
2018 - 2020
Metrics of Association Rules - Confidence
Transaction Milk Cookies Dry Flour
ID Fruits
1 1 1 1 1
2 1 0 1 1
3 0 0 1 1
4 0 1 0 0
5 1 1 1 1
6 1 1 0 1
22-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 12
2018 - 2020
• Thus, confidence {dry fruits} → {flour} = 5/5.
confidence {flour} → {dry fruits} = 4/5.
Hence, {dry fruits} → {flour} is a more important rule for this example.
• Hence, we can say that if A (antecedent) and B(consequent) are two items,
then confidence is P(B|A)/P(A).
• However, the antecedent for a frequent consequent does not matter much, as
the confidence for an association rule having a very frequent consequent will
always be high.
Toothpaste Bread
10
2 50
• Here, intuitively we know that toothpaste and bread have weak association,
but, {toothpaste} → {bread} has a confidence of 10/12 = 0.833 which is quite
high. This may lead to misinformation and hence we will later discuss lift as a
more robust metric.
22-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 13
2018 - 2020
Metrics of Association Rules - Conviction
1 − 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐵)
𝐶𝑜𝑛𝑣𝑖𝑐𝑡𝑖𝑜𝑛(𝑖𝑡𝑒𝑚𝑠 𝐴 𝑎𝑛𝑑 𝐵) =
1 − 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝐴 → 𝐵)
1−0.66
then = 1.36
1−0.75
• The value of conviction 1.36 shows that {milk} → {cookies} would lead to
wrong prediction 36% of the times (1.36 times as often) if the association
between these two items are said to be purely random in nature.
22-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 14
2018 - 2020
Metrics of Association Rules –
All-confidence
𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑋)
All−confidence (X) =
𝑚𝑎𝑥𝑥∈𝑋. 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑥)
• 𝑚𝑎𝑥𝑥∈𝑋. 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑥 ) signifies the support of the item which has highest
support for the item set X. All-confidence means that all the rules which can
be produced from the item set X have a least confidence of value returned
from this equation.
• All-confidence is null variant and it ranges from [0,1]. It can be efficiently used
for most mining algorithms.
22-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 15
2018 - 2020
Metrics of Association Rules –
Lift
Transaction ID Milk Cookies Dry Fruits Flour
1 1 1 1 1
2 1 0 1 1
3 0 0 1 1
4 0 1 0 0
5 1 1 1 1
6 1 1 0 1
22-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 16
2018 - 2020
Metrics of Association Rules –
Lift
𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑏𝑜𝑡 𝑋 𝑎𝑛𝑑𝑌/ 𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑋
𝐿𝑖𝑓𝑡 𝑋 → *𝑌+ =
𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑌
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑜𝑓 𝑋 →*𝑌+
»
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝑌)
0.75
• For this case, lift of {milk} → {cookies} is = 1.136
0.66
• A lift value which is higher than 1 implies that the probability of buying cookies
increases with the purchase of milk, in other words it has a significant effect of
association. On the other hand, lift values lower than 1 implies that chances of
buying cookies if milk has been bought does not increases.
• The lift values helps store managers to decide how to place items on the aisle,
items with higher lift values are kept in close comforts.
22-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 17
2018 - 2020
Case Study
• We have made a study to analyse a real world dataset using Python programming
language to find out the different association among the products.
• The dataset which we have considered for demonstrating the market basket
analysis, is of a departmental store (in the US) which consists of transactions over
a period of one month.
• The raw data is in the comma separated values format of MS-Excel (store_data.csv)
which has a total of 7501 transactions with 119 unique items being sold by the
store.
• The most frequent items of the dataset are ‘mineral water’ = 1788, ‘eggs’ = 1348,
spaghetti’ = 1306, ‘french fries’ = 1282, ‘chocolate’ = 1229.
22-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 18
2018 - 2020
Case Study
• The minimum number of items purchased in a single transaction is 1, while
maximum is 20. Item frequency of avocado was found to be highest with a value of
0.0333 and that of baby food was found to be least with a value of 0.00453.
• We have Python programming language, to perform the MBA using the ‘apriori’
function of the ‘apyori’ library. A similar apriori function can also be implemented
from a different library called ‘mlxtend’. For the plots, we have used the ‘seaborn’
• The parameters for the apriori function was given as :- 0.0045 should be the
22-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 20
2018 - 2020
Case Study
Top % most frequently sold items of the data were:
23-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 21
2018 - 2020
Case Study
• Out of the 48 association rules which were formed, fig 2 shows the plotting of
the rules based on the support values of each of the rules. {herb,
pepper}→{ground beef} can be observed having maximum value of 0.016
• Fig 3 shows the plotting of the support values of the association rules. {cooking
oil, ground beef}→{spaghetti} has a value of 0.5714 which is the maximum.
Figure 2 Figure 3
23-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 22
2018 - 2020
Case Study
• Fig 4 shows the plotting of the confidence values of the association rules,
sorted in descending order, to demonstrate the confidence values of each of
the transactions.
• Fig 5 shows the plotting of the lift values of the 48 association rules so formed.
The {light cream}→{chicken} rule has the highest values of 4.8439.
Figure 4 Figure 5
23-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 23
2018 - 2020
Case Study
• Fig 6 shows the plotting of the lift values of the association rules, sorted in
descending order. For our case, lift is the most important metric of association
for market basket analysis.
Figure 6
23-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 24
2018 - 2020
Case Study
The top 5 rules with highest lift values (sorted in descending order) are :
Since, only confidence values of the transactions are not good metrics for
considering to be important characteristics of association, a combined plotting of
the values will provide a better idea.
23-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 25
2018 - 2020
Case Study
Lift vs. Confidence
23-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 26
2018 - 2020
Conclusion
• In this project, we have discussed about association rule mining
and its application for market basket analysis.