You are on page 1of 55

APRIARI Algorithm

Introduction:

• It uses frequent itemsets to generate association rules and it works on databases that contain
transactions.
• With the help of this association rule , it determines how strongly or weakly two objects are
connected.
• This algorithm uses breadth – first search and Hash Tree to calculate the itemset associations
efficiently.
• This process works iteratively for finding the frequent itemsets from the large dataset.
Introduction:

• This algorithm was written by R.Agarwal and Srikant.


• It was developed in the year 1994.
• This algorithm is mainly employed to do market basket analysis .
• It helps to find those products that can be bought together.
• It is also used in medical field in order to find drug reactions to patients.
Frequent Itemset
• They are those items whose support is greater than the threshold value or
user – specified minimum support.
• If A and B are frequent itemsets together , then individually A and B
should also be the frequent itemset.
• There are two transactions , say : A = {8,9,10,11,12}, B = {10,11,19}
• In these two transactions , 10 and 11 are the frequent itemsets.
Reasons To Use Association Analysis:
• There are numeric ways in order to analyze the data .
• A variety of supervised and unsupervised machine learning approaches are
used in order to analyze the data.
• The main difficulty with these techniques is that they are difficult to tune ,
challenging to interpret , they require a bit of data preparation and feature
engineering in order to get good results.
• Association analysis requires only a few math concepts .
Reasons To Use Association Analysis:
• Unsupervised learning model looks for hidden patterns and there is a
limited need for data preparation and feature engineering.
• Market basket analysis is an application of association analysis.
Association Analysis

• Association rules goes like this : {Diapers}->{Beer} .


• There is a strong relationship between the customers that purchased diapers
and the customers that purchased beer.
• {Diaper} is the antecedent and {Beer} is the consequent.
• There can be multiple values in the antecedents and the consequents.
• {Diaper , Gum} -> {Beer , Chips}
Association Analysis(Terminologies)

• Support: It is a relative frequency that the rules show up.


• We need to find for high support in order to make sure that it is a useful
relationship.
• Low supports are useful if we are trying to find “Hidden” relationships.
Association Analysis(Terminologies)

• Confidence: It is a measure of reliability of the rule.


• A confidence of 0.5 would mean that in 50% of cases where Diaper and
gum were purchased , the purchase also included beer and chips.
• In product recommendations , 50% confidence is good whereas in medical
field , this confidence is not good enough.
Association Analysis(Terminologies)

• Lift: It is the ratio of observed support to that expected if two rules were
independent.
• The lift value close to 1 means that the rules were completely
independent.
• Lift values >1 are more useful and it is indicative of a useful rule pattern.
Steps For Apriori Algorithm
• Step 1: Find the support of itemsets in the transactional database , and
select the minimum support and confidence.

• Step 2: Take all the supports in the transaction with higher support value
than the selected or minimum support value.

• Step 3: Find all the rules of these subsets that have higher confidence
value than the threshold or minimum confidence.
Steps For Apriori Algorithm
• Step 4: Sort the rules as the decreasing order of fit.
Apriori Algorithm Working
• We have a dataset that contains various transactions .
• From this dataset , we need to find the frequent itemsets and create
association rules using the Apriori algorithm.
Apriori Algorithm Working
• Given: Min Support = 2, Minimum Confidence = 50%.
Solution
• Step 1:
• The first step is to create a table that contains the support count.
• The frequency of each itemset individually in the dataset of each itemset
in the given dataset.
• This table is called the candidate set or C1.
Solution
Solution
• Next , take out all the itemsets that have the greater support count than the
minimum support (2).
• This will give us the table for the frequent itemset L1.
• All the itemsets have greater or equal support count than the minimum
support , except E , so E itemset will be removed.
Solution
Solution
• Step 2: Candidate Generation , C2 and L2:
• In this step , we will generate C2 with the help of L1.
• In C2 , create a pair of itemsets of L1 in the form of subsets.
• After creating the subsets , we will again find the support count from the
main transaction table of datasets, i.e , how many times these pairs have
occured together in the given dataset.
Solution
• Step 2: Candidate Generation , C2 and L2:
• This will give us the below table for C2.
Solution
• Again , we need to compare the C2 support count with the minimum
support count , and after comparing , the itemset with less support count
will be eliminated from the table C2.
• This will give us the table for L2.
Solution
• Candidate Generation , C3 and L3:
• For C3 , repeat the same two processes.
• Now , form the C3 table with the subsets of three itemsets together.
• Calculate the support count from the dataset.
Solution
Solution
• Next , create the L3 table.
• From the C3 table , we can find that there is only one combination of
itenset that has support count equal to the minimum support count.
• L3 will have only one combination, {A,B,C}.
Solution

• Step 4: Finding the association rules for the subsets:


• To generate the association rules, create a new table with the possible rules
from the occurred combination{A,B.C}.
• For all the rules , calculate the confidence using the formula, sup( A ^B)/A.
• After computing the confidence value for all rules, exclude the rules that
have less confidence than the min. threshold(50%).
Solution
• Example Association Rule Table:
Solution

• The min . Threshold or confidence level is 50%, so first three rules are
considered as strong association rules for the given problem.
Advantages:

• This is an easy to understand algorithm.


• The join and prune steps of the algorithm can be easily implemented on
large datasets.
Disadvantages:
• This algorithm works at a slow rate when compared to other algorithms.
• The total performance is reduced as it scans the database for multiple
times.
• The time complexity and space complexity of Apriori algorithm is very
high.
Python Implementation:
• Many retail shops are available in today’s world.
• They are trying to find the association between the shop’s product , “Buy
this and get that” to his customers.
• The retailer has a dataset information.
• It contains a list of transactions made by the customer.
• In this dataset , each row denotes the products purchased by the customers
or the list of all the transactions.
Python Implementation:

• The problem is solved using the following steps:


• Pre – Processing
• Training the Apriori Model.
• Data Visualization.
Python Implementation:
• Pre – Processing Step:
• Importing the Libraries:
• First step in any model implementation is that we need to import the
corresponding module:
• Pip install apyroi
Python Implementation:
• Next , import the essential libraries.
• import pandas as pd
• from mlxtend.frequent_patterns import apriori
• from mlxtend.frequent_patterns import association_rules
Python Implementation:

• Import Step:
• The first step is to import the dataset required for our model:
• The rows of the dataset are showing transactions made by the customers.
• The first row in the transaction is made by the first customer.
• There is no specific name for each column and they have their own individual value or product
details.
• There is no header specified.
Python Implementation:

• Import Step:

• df = pd.read_excel('http://archive.ics.uci.edu/ml/machine-learning-
databases/00352/Online%20Retail.xlsx')

• df.head()
Python Implementation:
Python Implementation:

• The next step is the data preparation step.


• We need to remove any unwanted null values and unwanted spaces in the
dataset.
• Drop the rows that don’t have invoice numbers and remove the credit
transactions.
• Those with invoice numbers that contaisn the letter C.
Python Implementation:
• df['Description'] = df['Description'].str.strip()
• df.dropna(axis=0, subset=['InvoiceNo'], inplace=True)
• df['InvoiceNo'] = df['InvoiceNo'].astype('str')
• df = df[~df['InvoiceNo'].str.contains('C')]
Python Implementation:
• After the cleanup phase , the next step is that we need to group the items
into 1 transaction per row with each product one hot encoded.
• In order to keep the dataset small, look at only the sales data of France.
• Next , the sales data of France will be compared with the sales data of
Germany.
Python Implementation:

• basket = (df[df['Country'] =="France"]


• .groupby(['InvoiceNo', 'Description'])['Quantity']
• .sum().unstack().reset_index().fillna(0)
• .set_index('InvoiceNo'))
Python Implementation:
• The first few columns of the dataset would like this:
Python Implementation:
• There are lots of zeros in the data but any positive values are converted to
1 and anything less than 0 is set to 0.
• This step will complete the one hot encoding of the data and it will
remove the postage column.
Python Implementation:
• def encode_units(x):
• if x <= 0:
• return 0

• if x >= 1:
• return 1

• basket_sets = basket.applymap(encode_units)
• basket_sets.drop('POSTAGE', inplace=True, axis=1)
Python Implementation:

• Data is structured properly and we can generate frequent itemsets that


have a support of atleast 7%.
• frequent_itemsets = apriori(basket_sets, min_support=0.07,
use_colnames=True)
Python Implementation:

• The final step is to generate rules according to their support , confidence


and lift.
• rules = association_rules(frequent_itemsets, metric="lift",
min_threshold=1)
• rules.head()
Python Implementation:
Python Implementation:
• The frequent items are built using Apriori and the rules are built using
association_rules.
• There are a few rules with the high lift value.
• It means that it occurs more frequently than the number of transactions
and the product combinations.
• Filter the dataframe using the pandas code.
Python Implementation:
• Look for a large lift(6) and high confidence(8).
• rules[ (rules['lift'] >= 6) &
• (rules['confidence'] >= 0.8) ]
Python Implementation:
Python Implementation:

• When we look at the rules , we can find that red and green alarm clocks are
purchased together .
• Red paper cups , napkins and plates are purchased together.
• The popularity of one product can be used to drive the sales of the other product.
• For eg, we can find that we sell 340 Green Alarm clocks but only 316 Red Alarm
Clocks.
Python Implementation:
• So , we can drive more Red Alarm clock sales through recommendations.
• basket['ALARM CLOCK BAKELIKE GREEN'].sum()

• 340.0

• basket['ALARM CLOCK BAKELIKE RED'].sum()

• 316.0
Python Implementation:
• The combinations vary by country of purchase.
• basket2 = (df[df['Country'] =="Germany"]
• .groupby(['InvoiceNo', 'Description'])['Quantity']
• .sum().unstack().reset_index().fillna(0)
• .set_index('InvoiceNo'))
Python Implementation:
• basket_sets2 = basket2.applymap(encode_units)
• basket_sets2.drop('POSTAGE', inplace=True, axis=1)
• frequent_itemsets2 = apriori(basket_sets2, min_support=0.05, use_colnames=True)
• rules2 = association_rules(frequent_itemsets2, metric="lift", min_threshold=1)

• rules2[ (rules2['lift'] >= 4) &


• (rules2['confidence'] >= 0.5)]
Python Implementation:
Conclusion
• It is easy to run and easy to interpret algorithm.
• If we did not have Mlxtend , it is difficult to find these patterns using
basic excel analysis.
• With python and Mlxtend, the process can be analyzed easily and we need
to access all the additional visualization techniques and data analysis tools
in python ecosystem.

You might also like