You are on page 1of 31

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/343163746

Association Rule Mining for Market Basket Analysis

Presentation · July 2020


DOI: 10.13140/RG.2.2.32305.61286

CITATIONS READS

0 343

2 authors:

Abhishek Dutta Jaydip Sen


NSHM Knowledge Campus Praxis Business School
19 PUBLICATIONS   43 CITATIONS    340 PUBLICATIONS   3,995 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Defending Distributed Denial of Service Attacks View project

Building Intelligent Recommender Systems View project

All content following this page was uploaded by Abhishek Dutta on 23 July 2020.

The user has requested enhancement of the downloaded file.


Association Rule Mining for Market
Basket Analysis

Master of Science (Data Science & Analytics) Batch 2018 –2020


Minor Project Presentation

Abhishek Dutta
(Reg. No: 182341810018)
Under the Supervision of Prof. Jaydip Sen
NSHM Knowledge Campus, Kolkata, INDIA
Affiliated to
Maulana Abul Kalam Azad University of Technology, Kolkata, INDIA

22-07-2020 M Sc (Data Science & Analytics) Mini Project Presentation : Batch 1


2018 - 2020
Objective of the Project
• The main objective of the work is to understand the concept of association
rule mining and its application in market basket analysis.

• Understanding the keywords associated with this work like support,


confidence and lift; and how are they important in the examples that we have
used.

• The project of two real world datasets for market basket analysis. The first
data set has been analysed with the help of Python programming language
and the second dataset using R programing language.

22-07-2020 M Sc (Data Science & Analytics) Mini Project Presentation : Batch 2


2018 - 2020
Outline
• Association Rule Mining
• Market Basket analysis
• Examples
• Support
• Confidence
• Conviction
• All-confidence
• Lift
• Case Study using Python
• Conclusion
• References

22-07-2020 M Sc (Data Science & Analytics) Mini Project Presentation : Batch 3


2018 - 2020
Association Rule Mining
• Association rule mining is rule based machine learning method which is used
for discovering relationships and patterns between various items in a large
datasets.

• For example, association rule mining discovers regularities between products


in large scale transactions, as we can see in point-of-sale systems of
supermarkets. This will help extensively in marketing activities such as
‘product placements’ and ‘pricing’.

• Association rule mining also has other applications such as web usage mining,
intrusion detection, bioinformatics etc.

22-07-2020 M Sc (Data Science & Analytics) Mini Project Presentation : Batch 4


2018 - 2020
Association Rule Mining
• Association rules can be thought of as an IF-THEN relationship. If we
consider A as an item, then the chances of picking item B by the same
customer (for the same Transaction ID) is found out.

(If) A → B (Then)
• Out of the two elements, we can say that A is the ‘antecedent’ (the IF part)
and B as the ‘consequent’ (the THEN part).

• So if customers buy bread and butter and see a discount or an offer on


eggs, they will be encouraged to spend more and buy the eggs. This is
what market basket analysis is all about.

22-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 5
2018 - 2020
Association Rule Mining
If we consider a store data set as :
Transaction Item 1 Item 2 Item 3 Item 4
ID
1 Bread BBQ sauce - Meat
2 Meat Bread BBQ sauce Beer
3 Meat - - Beer
4 Beer BBQ sauce Meat -
5 Bread Meat Beer BBQ sauce
We can assume that if a consumer buys Meat are more likely to buy Beer. Thus,

Meat Beer
Similarly, people who buy Bread & Meat, tends to also buy Barbeque sauce and
Beer
IF {Meat, Bread} THEN {Beer, BBQ}
22-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 6
2018 - 2020
Market Basket Analysis
• The applications of Association Rule Mining are found in Marketing, Market
Basket Analysis (also known as Basket Data Analysis) in retailing, clustering and
classification.

• They find the relationships between the set of elements of every transactions
made, and not individual features. Hence, it is different from collaborative
filtering.

• Effectively, market basket analysis is the likelihood of various products


occurring together. It helps to tell us when certain things will be bought
together by using various algorithms like apriori, ECLAT, FP-growth etc.

22-07-2020 M Sc (Data Science & Analytics) Mini Project Presentation : Batch 7


2018 - 2020
Apriori Algorithm
• The apriori algorithm assumes that the subset of a frequent set must be
frequent. And this algorithm is used in the market basket analysis (MBA).

• For example, if we have a transaction like buying of fruits apple, grapes


and mango; then principle of apriori says that if {Apples, Grapes,
Mangoes} is frequent, then {Grapes, Mangoes} must also be frequent.

• We have considered a dataset of six transactions, each transaction has a


combination of 0’s and 1’s (one hot encoded dataset). The 0’s indicates the
absence of an item, and 1’s represent that the item is present.

22-07-2020 M Sc (Data Science & Analytics) Mini Project Presentation : Batch 8


2018 - 2020
Transaction ID Milk Cookies Dry fruits Flour
1 1 1 1 1
2 1 0 1 1
3 0 0 1 1
4 0 1 0 0
5 1 1 1 1
6 1 1 0 1

We will consider this transaction dataset for discussing the MBA


metrics:
1. Support
2. Confidence
3. Lift
4. Conviction.

22-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 9
2018 - 2020
Metrics of Association Rules - Support
Support – Support tells us how frequently a particular item appears a dataset.
Thus, if we consider an item ‘A’ as antecedent and ‘B’ as consequent, then
support of A→B is the ratio of transactions containing items ‘A’ and ‘B’ to total
number of transactions.

𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑖𝑛𝑣𝑜𝑙𝑣𝑖𝑛𝑔 𝑏𝑜𝑡𝑕 𝑖𝑡𝑒𝑚𝑠


𝑆𝑢𝑝𝑝𝑜𝑟𝑡 =
𝑇𝑜𝑡𝑎𝑙 𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠
3
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑀𝑖𝑙𝑘 → 𝐶𝑜𝑜𝑘𝑖𝑒𝑠 = = 0.5
6

50.0% of the transactions contain both milk and cookies together. Thus,
P(A|B) or P(B|A) is the support, which is also known as Coverage.

22-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 10
2018 - 2020
• Support helps us to identify the rules which are worthy of being considered for
analysis. The example which we discussed is a hypothetical situation, however,
in reality it is not that simple. A departmental store for example can have
10000 transactions, and in that case out support falls down to 0.0005.

• Items having lower support values are thus not sure about the relationship
between them due to lack of information, and we cannot draw a conclusion.

• Thus, support is not often not enough to be considered as an important metric


for determining association between products.

• For example – a team scoring 5 goals in a match has a very low chance of
losing the match. However, we chances of scoring 5 goals in a match is also
very low.

22-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 11
2018 - 2020
Metrics of Association Rules - Confidence
Transaction Milk Cookies Dry Flour
ID Fruits
1 1 1 1 1
2 1 0 1 1
3 0 0 1 1
4 0 1 0 0
5 1 1 1 1
6 1 1 0 1

(𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑖𝑛𝑣𝑜𝑙𝑣𝑖𝑛𝑔 𝑏𝑜𝑡𝑕 𝑖𝑡𝑒𝑚𝑠)


𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 =
𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑡𝑕𝑒 𝑎𝑛𝑡𝑒𝑐𝑒𝑑𝑒𝑛𝑡𝑠

• If we stick to the same example, likelihood of purchasing dry fruits leading


to purchase of flour is 4/4 i.e., 100%. However, likelihood of purchasing
flour resulting in buying of dry fruits is i.e., 80%.

22-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 12
2018 - 2020
• Thus, confidence {dry fruits} → {flour} = 5/5.
confidence {flour} → {dry fruits} = 4/5.
Hence, {dry fruits} → {flour} is a more important rule for this example.
• Hence, we can say that if A (antecedent) and B(consequent) are two items,
then confidence is P(B|A)/P(A).
• However, the antecedent for a frequent consequent does not matter much, as
the confidence for an association rule having a very frequent consequent will
always be high.

Toothpaste Bread
10
2 50

• Here, intuitively we know that toothpaste and bread have weak association,
but, {toothpaste} → {bread} has a confidence of 10/12 = 0.833 which is quite
high. This may lead to misinformation and hence we will later discuss lift as a
more robust metric.

22-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 13
2018 - 2020
Metrics of Association Rules - Conviction
1 − 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐵)
𝐶𝑜𝑛𝑣𝑖𝑐𝑡𝑖𝑜𝑛(𝑖𝑡𝑒𝑚𝑠 𝐴 𝑎𝑛𝑑 𝐵) =
1 − 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝐴 → 𝐵)

• Conviction can be interpreted as the ratio of the probability that item A


appears in a transaction without item B (if they were independent events),
divided by the probability of incorrect predictions.

• Form our previous example, if we try to find conv{milk} → {cookies},

1−0.66
then = 1.36
1−0.75

• The value of conviction 1.36 shows that {milk} → {cookies} would lead to
wrong prediction 36% of the times (1.36 times as often) if the association
between these two items are said to be purely random in nature.

22-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 14
2018 - 2020
Metrics of Association Rules –
All-confidence

𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑋)
All−confidence (X) =
𝑚𝑎𝑥𝑥∈𝑋. 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑥)

• 𝑚𝑎𝑥𝑥∈𝑋. 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑥 ) signifies the support of the item which has highest
support for the item set X. All-confidence means that all the rules which can
be produced from the item set X have a least confidence of value returned
from this equation.

• All-confidence is null variant and it ranges from [0,1]. It can be efficiently used
for most mining algorithms.

22-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 15
2018 - 2020
Metrics of Association Rules –
Lift
Transaction ID Milk Cookies Dry Fruits Flour

1 1 1 1 1
2 1 0 1 1
3 0 0 1 1
4 0 1 0 0
5 1 1 1 1
6 1 1 0 1

• If we consider the transactions of {milk}→{cookies}, we can see that,


Support {milk}→{cookies}= 3/6 = 0.5 Confidence{milk}→{cookies}= 3/4 = 0.75

• If we want to analyse, whether buying of milk actually leads to buying of cookies,


we can observe that probability of milk purchase is 4/6 is less, compared to that of
cookies which is 5/6.

22-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 16
2018 - 2020
Metrics of Association Rules –
Lift
𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑏𝑜𝑡𝑕 𝑋 𝑎𝑛𝑑𝑌/ 𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑋
𝐿𝑖𝑓𝑡 𝑋 → *𝑌+ =
𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑌
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑜𝑓 𝑋 →*𝑌+
»
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝑌)

0.75
• For this case, lift of {milk} → {cookies} is = 1.136
0.66

• A lift value which is higher than 1 implies that the probability of buying cookies
increases with the purchase of milk, in other words it has a significant effect of
association. On the other hand, lift values lower than 1 implies that chances of
buying cookies if milk has been bought does not increases.

• The lift values helps store managers to decide how to place items on the aisle,
items with higher lift values are kept in close comforts.

22-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 17
2018 - 2020
Case Study
• We have made a study to analyse a real world dataset using Python programming
language to find out the different association among the products.

• The dataset which we have considered for demonstrating the market basket
analysis, is of a departmental store (in the US) which consists of transactions over
a period of one month.

• The raw data is in the comma separated values format of MS-Excel (store_data.csv)
which has a total of 7501 transactions with 119 unique items being sold by the
store.

• The most frequent items of the dataset are ‘mineral water’ = 1788, ‘eggs’ = 1348,
spaghetti’ = 1306, ‘french fries’ = 1282, ‘chocolate’ = 1229.

22-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 18
2018 - 2020
Case Study
• The minimum number of items purchased in a single transaction is 1, while

maximum is 20. Item frequency of avocado was found to be highest with a value of

0.0333 and that of baby food was found to be least with a value of 0.00453.

• We have Python programming language, to perform the MBA using the ‘apriori’

function of the ‘apyori’ library. A similar apriori function can also be implemented

from a different library called ‘mlxtend’. For the plots, we have used the ‘seaborn’

and ‘matplotlib’ libraries.

• The parameters for the apriori function was given as :- 0.0045 should be the

minimum support value for a transaction to be considered. Minimum confidence

value was set as 0.2.


22-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 19
2018 - 2020
Case Study

• The minimum lift value was required to be 3, for any transaction to be


considered as worth analysing.

• A total of 48 association rules were found by the program, after


implementing the apriori function on the dataset. The dataset was analysed
and we stores the metrics of support, confidence and lift in a separate file.

• Among the 48 rules, The maximum values of support {herb,


pepper}→{ground beef} is 0.016, confidence for {cooking oil, ground
beef}→{spaghetti} is 0.5714 and lift {light cream}→{chicken} was found to
be 4.8439.

22-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 20
2018 - 2020
Case Study
Top % most frequently sold items of the data were:

Items Transaction frequency


Mineral water 1788
Eggs 1348
Spaghetti 1306
French fries 1282
Chocolate 1229

Figure 1 shows the top 20 most frequently sold items

23-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 21
2018 - 2020
Case Study
• Out of the 48 association rules which were formed, fig 2 shows the plotting of
the rules based on the support values of each of the rules. {herb,
pepper}→{ground beef} can be observed having maximum value of 0.016

• Fig 3 shows the plotting of the support values of the association rules. {cooking
oil, ground beef}→{spaghetti} has a value of 0.5714 which is the maximum.

Figure 2 Figure 3
23-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 22
2018 - 2020
Case Study
• Fig 4 shows the plotting of the confidence values of the association rules,
sorted in descending order, to demonstrate the confidence values of each of
the transactions.
• Fig 5 shows the plotting of the lift values of the 48 association rules so formed.
The {light cream}→{chicken} rule has the highest values of 4.8439.

Figure 4 Figure 5
23-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 23
2018 - 2020
Case Study
• Fig 6 shows the plotting of the lift values of the association rules, sorted in
descending order. For our case, lift is the most important metric of association
for market basket analysis.

Figure 6
23-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 24
2018 - 2020
Case Study
The top 5 rules with highest lift values (sorted in descending order) are :

Rules Lift values Confidence


{light cream}→{chicken} 4.8439 0.2905
{light cream} →{chicken, nan} 4.8439 0.2905
{pasta} →{escalope} 4.7008 0.3728
{pasta} →{nan, escalope} 4.7008 0.3728
{pasta} →{nan, shrimp} 4.5150 0.3220

Since, only confidence values of the transactions are not good metrics for
considering to be important characteristics of association, a combined plotting of
the values will provide a better idea.

23-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 25
2018 - 2020
Case Study
Lift vs. Confidence

Figure 7 shows the Lift vs Confidence


plotting of the association rules

23-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 26
2018 - 2020
Conclusion
• In this project, we have discussed about association rule mining
and its application for market basket analysis.

• We have discussed the calculation and importance of various


metrics like support, confidence, lift, all-confidence, conviction.

• A case study was done, using Python programming language to


analyse a departmental store data consisting of 7501 records and
found the association rules with their corresponding metrics. We
have used the apriori function for the process.

• For better understanding and visualisation, we have plotted the


rules and made a combined effort to infer the best possible rule.
23-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 27
2018 - 2020
References
• Agarwal, R., Imielinski, and Swami, A. (1993). Mining association rules between sets of items
in a large databases. Proceedings of the 1993 ACM SIGMOD International Conference
on Management of Data, pp. 207-216.
• Srikant, R. and Agarwal, R. (1996). Mining quantitative association rules in large relational
tables. Proceedings of the 1996 ACM SIGMOD International Conference on
Management of Data (ACM SIGMOD 1996), pp. 1 -12.
• Zaki, M. J. (2000). Scalable algorithms for association mining (2000). IEEE Transactions on
Knowledge and Data Engineering, Vol 12, No 3, 2000, pp. 372 –390.
• Mata, J., Alvarez, J. L., Riquelme, J. C. (2001). Mining numeric association rules with genetic
algorithms. Proceedings of the 5thInternational Conference on Artificial Neural
Networks and Genetic Algorithms (ICANNGA 2001), pp. 264 –267.
• Hong, T., Kuo, C., and Chi, S. (2001). Trade-off between computation and number rules for
fuzzy mining from quantitative data. International Journal of Uncertainty, Fuzziness
and Knowledge-Based Systems, Vol 9, No 5, 2001, pp 587-604.
• Mata, J., Alvarez, J. L., and Riquelme, J. C. (2002). Discovering numeric association rules via
evolutionary algorithm. Proceedings of the 6thPacific-Asia Conference on Advances in
Knowledge Discovery and Data Mining (PAKDD, 2002), pp. 40-51, ISBN: 1-58113-445-2.
• Han, J., Pei, J., Yin, Y., and Mao, R. (2004). Mining frequent patterns without candidate
generation: A frequent-pattern tree approach. Data Mining and Knowledge
Discovery, Vol 8, No 1, pp 53-87, 2004.
• Cheffer, T. S. (2005). Finding association rules that trade support optimally against confidence.
Intelligent Data Analysis, Vol 9, No 4, pp. 381-395, 2005.
• Alatas, B. and Akin, E. (2006). An efficient genetic algorithm for automated mining of both
positive and negative quantitative association rules. Soft Computing, Vol 10, No 3,
pp. 230-237, 2006.
23-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 28
2018 - 2020
References
• Yan, X. Zhang, C., and Zhang, S. (2009). Genetic algorithm-based strategy for identifying
association rules without specifying actual minimum support. Expert Systems with
Applications, Vol 36, pp. 3066-3076, 2009.
• Alcala-Fdez, J., Alcala, R., Gacto, M. J., and Herrera, F. (2009). Learning the membership function
contexts for mining fuzzy association rules by using genetic algorithms. Fuzzy Sets and
Systems, Vol 160, No 7, pp. 905-921, 2009.
• Luna, J. M., Romero, J. R., and Ventura, S. (2012). Design and behavior study of a grammar guided
genetic programming algorithm for mining association rules. Knowledge and Information
System, Vol 32, No 1, pp. 53-76, 2012.
• Luna, J. M., Romero, J. R., and Ventura, S. (2013). Grammar-based multi-objective algorithms for
mining association rules. Data & Knowledge Engineering, Vol 86, pp. 19-37, 2013.
• Luna, J. M., Romero, J. R., and Ventura, S. (2014). On the adaptability of G3PARM to the extraction
of rare association rules. Knowledge and Information Systems, Vol 38, No 2, pp 391-418,
2014.
• Zhang, H., Song, W., Liu, L., and Wang, H. (2017). The application of matrix Apriori algorithm in
web log mining. Proceedings of the 2017 IEEE 2ndInternational Conference on Big Data
Analysis (ICBDA), March 2017, Beijing, China.
• Ledolter, J. Data Mining and Business Analytics using R, ISBN: 978-1-118-44714-7, June 2013,
John Wiley Publication.
• Witten, I. H., Frank, E., & Hall, M. A. Data Mining: Practical Machine Learning Tools and
Techniques, ISBN: 9780123748560, January 2011, Morgan Kaufmann.
• Han, J. Kamber, M. & Pei, J. Data Mining: Concepts and Techniques, ISBN: 9780123814791, June
2011, Morgan Kaufmann.
• Ye, N. Data Mining: Theories, Algorithms, and Examples, ISBN: 9781138073661, April 2017, CRC
Press.
• Hand, D., Mannila, H., & Smyth, P. Principles of Data Mining, ISBN: 9780262082907, August 2001,
MIT Press.
23-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 29
2018 - 2020
23-07-2020 MSc (Data Science & Analytics) Mini Project Presentation : Batch 30
View publication stats 2018 - 2020

You might also like